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Purpose- In recent years Monte-Carlo sampling methods, such as Monte Carlo tree search, have 
achieved tremendous success in model free reinforcement learning. A combination of the so called 
upper confidence bounds policy to preserve the "exploration vs. exploitation" balance to select actions 
for sample evaluations together with massive computing power to store and to update dynamically a 
rather large pre-evaluated game tree lead to the development of software that has beaten the top human 
player in the game of Go on a 9 by 9 board. Much effort in the current research is devoted to widening 
the range of applicability of the Monte-Carlo sampling methodology to partially observable Markov 
decision processes with non-immediate payoffs. The main challenge introduced by randomness and 
incomplete information is to deal with the action evaluation at the chance nodes due to drastic dif- 
ferences in the possible payoffs the same action could lead to. The aim of this article is to establish 
a version of a theorem that originated from population genetics and has been later adopted in evo- 
lutionary computation theory that will lead to novel Monte-Carlo sampHng algorithms that provably 
increase the AI potential. Due to space limitations the actual algorithms themselves will be presented 
in the sequel papers, however, the current paper provides a solid mathematical foundation for the de- 
velopment of such algorithms and explains why they are so promising. 

Design/Methodology/Approach- In the current paper we set up a mathematical framework, state and 
prove a version of a Geiringer-like theorem that is very well-suited for the development of Mote-Carlo 
sampling algorithms to cope with randomness and incomplete information to make decisions. From 
the framework it will be clear that such algorithm increase what seems like a limited sample of roll- 
outs exponentially in size by exploiting the symmetry within the state space at little or no additional 
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computational cost. Appropriate notions of recombination (or crossover) and schemata are introduced 
to stay inline with the traditional evolutionary computation terminology. The main theorem is proved 
using the methodology developed in the PhD thesis of the first author, however the general case of non- 
homologous recombination presents additional challenges that have been overcome thanks to a lovely 
application of the classical and elementary tool known as the "Markov inequality" together with the 
lumping quotients of Markov chains techniques developed and successfully applied by the authors in 
the previous research for different purposes. This methodology will be mildly extended to estabhsh 
the main result of the current article. In addition to establishing the Gerringer-like theorem for Monte 
Carlo sampling, which is the central objective of this paper, we also strengthen the appUcability of 
the core theorem from the PhD thesis of the first author on which our main result rests. This provides 
additional theoretical justification for the anticipated success of the presented theory. 
Findings- This work establishes an important theoretical link between classical population genetics, 
evolutionary computation theory and model free reinforcement learning methodology. Not only the 
theory may explain the success of the currently existing Monte-Carlo tree sampling methodology, but 
it also leads to the development of novel Monte-Carlo sampling techniques guided by rigorous math- 
ematical foundation. 

Practical implications- The theoretical foundations established in the current work provide guidance 
for the design of powerful Monte-Carlo sampling algorithms in model free reinforcement learning to 
tackle numerous problems in computational intelligence. 

Originality/value- Establishing a Geiringer-like theorem with non-homologous recombination was a 
long standing open problem in evolutionary computation theory. Apart from overcoming this chal- 
lenge, in a mathematically elegant fashion and establishing a rather general and powerful version of 
the theorem, this work leads directly to the development of novel provably powerful algorithms for 
decision making in the environment involving randomness, hidden or incomplete information. 

Keywords: Reinforcement learning; partially observable Markov decision processes; Monte Carlo tree 
search; upper confidence boimds for trees, evolutionary computation; Geiringer Theorem; schemata; 
non-homologous recombination (crossover); Markov chains; lumping quotients of Markov chains; 
Markov inequality; contraction mapping principle; irreducible Markov chains; non-homogenous 
Markov chains. 



1. Introduction 

A great number of questions in machine learning, computer game intelligence, control the- 
ory, and numerous other applications involve the design of algorithms for decision-making 
by an agent under a specified set of circumstances. In the most general setting, the prob- 
lem can be described mathematically in terms of the state and action pairs as follows. A 
state- action pair is an ordered pair of the form {s, a) where a = {ai, ct2, . . . , ccn} is 
the set of actions (or moves, in case the agent is playing a game, for instance) that the 
agent is capable of taking when it is in the state (or, in case of a game, a state might be 
sometimes referred to as a position) s. Due to randomness, hidden features, lack of mem- 
ory, limitation of the sensor capabilities etc, the state may be only partially observable by 
the agent. Mathematically this means that there is a function : 5 — > O (as a matter of 
fact, a random variable with respect the unknown probability space structure on the set S) 
where S is the set of all states which could be either finite or infinite while O is the set 
(usually finite due to memory limitations) of observations having the property that when- 
ever ^{si) = (j){s2) (i.e. whenever the agent can not distinguish states s\ and S2) then the 
corresponding state action pairs (,si, a) and (s2, 0) are such that a. = (i.e. the agent 
knows which actions it can possibly take based only on the observation it makes). The gen- 
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eral problem of reinforcement learning is to decide which action is best suited given the 
agent's knowledge (that is the observation that the agent has made as well as the agent's 
past experience). In computational settings "suitability" is naturally described in terms of a 
numerical reward value. In the probability theoretic sense the agent aims to maximize the 
expected reward (the expected reward considered as a random variable on the enormous 
and unknown conditional probability space of states given a specific observation and an 
action taken). Most common models such as POMDPs (partially observable Markov deci- 
sion processes) assume that the next state and the corresponding numerical rewards depend 
stochastically only on the current observation and action. In a number of situations the im- 
mediate rewards after executing a single action are unknown. The so-called "model free" 
reinforcement learning methods, such as Monte Carlo techniques (i.e. algorithms based 
on repeated random sampling) are exploited to tackle problems of this type. In such al- 
gorithms a large number of rollouts (i.e. simulations or self-plays) are made and actions 
are assigned numerical payoffs that get updated dynamically (i.e at every simulation of an 
algorithm). While the simulated self-plays started with a specific chosen action, say a, are 
entirely random, the action a itself is chosen with respect to a dynamically updated proba- 
bility distribution which ensures the exploration versus exploitation balance: the technique 
known as UCB (Upper Confidence Bounds). It may be worth emphasizing that the UCB 
methodology is based on a solid mathematical foundation (see HI, ITOl and IS). A combi- 
nation of UCB with Monte Carlo sampling lead to tremendous break through in computer 
Go performance level (see 15] and [6], for instance) and much research is currently under- 
going to widen the applicability of the method. Some of the particularly challenging and 
interesting directions involve decision making in the environments (or games) involving 
randomness, hidden information and uncertainty or in "continuous" environments where 
appropriate similarities on the set of states must be constructed due to runtime and mem- 
ory limitations and also action evaluation polices must be enhanced to cope with drastic 
changes in the payoffs as well as an enormous combinatorial explosion in the branching 
factor of the decision tree. In recent years a number of heuristic approaches have been 
proposed based on the existing probabilistic planning methodology. Despite some of these 
newly developed methods have already achieved surprisingly powerful performance levels: 
see 1231 and ll24ll . the authors believe there is still room for drastic improvement based on 
the rigorous mathematical theory originated from classical population genetics (fSl) and 
later adopted in traditional evolutionary computation theory (| 18|, [13J [12J). Theorems of 
this type are known as Geiringer-like results and they address the limiting "frequency of 
occurrence" of various sets of "genes" as recombination is repeatedly applied over time. 
The main objective of the current work is to establish a rather general and powerful version 
of a Geiringer-like theorem with "non-homologous" recombination operators in the setting 
of Monte Carlo sampling. This theorem leads to simple dynamic algorithms that exploit 
the intrinsic similarity within the space of observations to increase exponentially the size 
of the already existing sample of rollouts yielding significantly more informative action- 
evaluation at very little or even no additional computational cost at all. The details of how 
this is done will be described in sections [3] and |4] Due to space limitations, the actual algo- 
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rithms will appear in sequel papers. As a matter of fact, we believe the interested readers 
may actually design such algorithms on their own after studying sections [3] and |4] 

2. Overview 

Due to the interdisciplinary nature of this work the authors did their best to make the pa- 
per accessible on various levels to a potentially wide audience having diverse backgrounds 
and research interests ranging from practical software engineering to appUed mathematics, 
theoretical computer science and high-level algorithm design based on solid mathematical 
foundation. The next section (section O is essential for understanding the main idea of the 
paper. It provides the notation and sets up a rigorous mathematical framework, while the 
informal comments motivating the various notions introduced, assist the reader's compre- 
hension. Section|4]contains all the necessary definitions and concepts required to state and 
to explain the results of the article. It ends with the statement of Geiringer-like theorem 
aimed at applications to decision making in the environments with randomness and incom- 
plete information where no immediate rewards are available. This is the central aim of the 
paper A reader who is only after a calculus level understanding with the aim of developing 
applications within an appropriate area of software engineering may be satisfied reading 
section |4] and finishing their study at this point. Section |5] is devoted to establishing and 
deriving the main results of the article in a mathematically rigorous fashion. Clearly this 
is fundamentally important for understanding where these results come from and how one 
may modify them as needed. We strongly encourage all the interested readers to attempt 
understanding the entire section|5] Subsection lS.ll does require familiarity with elementary 
group theory. A number of textbooks on this subject are available (see, for instance, Q) 
but all of them contain way more material than necessary to understand our work. To get 
the minimal necessary understanding, the reader is invited to look at the previous papers 
on finite population Geiringer theorems of the first two authors: fT3^ and fT2l. Finally, sec- 
tion|6]is included only for the sake of strengthening the general finite-population Geiringer 
theorem to emphasize its validity for nonhomogenious time Markov chains, namely the- 
orem |23] Example |24l explains why this is of interest for the algorithm development. The 
material in section|6]is entirely independent of the rest of the paper One could read it either 
at the beginning or at the end. The authors suspect this theory is known in modern math, 
but the literature emphasizing theorems |72]and|8T|is virtually impossible to locate. More- 
over, mathematics behind these theorems is classical, general, simple and elegant. While 
section|6]is probably not of any interest to software engineers (theorem|23]may be thought 
to strengthen the justification of the main ideas), more mathematically inclined audience 
will find it enjoyable and easy to read. 

3. Equivalence/Similarity Relation on tlie States 

Let S denote the set of states (enormous but finite in this framework). Formally each state 
s e is an ordered pair (s, a) where a is the set of actions an agent can possibly take 
when in the state s. Let ~ be an equivalence relation on S. Without loss of generality we 
will denote every equivalence class by an integer 1, 2,...,i,...,eNso that each element 
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of S as an ordered pair (i, a) where i e N and a e A with A being some finite alphabet. 
With this notation {i, a) ^ {j, b) iff i = j. Intuitively, S is the set of states and ~ is the 
similarity relation on the states. For example in a card game if the 2 states corresponding 
to the same player have cards of roughly equivalent value (for that specific game) and 
their opponent's cards are unknown (and there might be some more hidden and random 
effects) then the 2 states will be considered equivalent under ^. We will also require that 
for two equivalent states si — {si, q?i} and S2 = {s2, "2} under ^ there are bijections 
/i : Q?! — ?• Q?2 and /2 : a2 — ?► cJi. For the time being, these bijections should be obvious 
from the representation of the environment (and actions) and reflect the similarity between 
these actions. 

Remark 1. In theory we want functions /i and /2 to be bijections and inverses of one 
another for the theoretical model to be perfectly rigorous, but in practice there should prob- 
ably be no strict requirement on that. In fact, we believe that in practice one may even want 
to relax the assumption on ^ to be an equivalence relation. 

As described in sections [T] and |2] the most challenging question when applying an MCT 
type of an algorithm to deal with randomness and incomplete information or simply with a 
large branching factor of the game tree is to evaluate the actions under consideration mak- 
ing the most out of the sample of independent rollouts. Quite surprisingly, very powerful 
programs have already been developed and tested in practice against human players (see 
ifTTi). however the action-evaluation algorithms used in these software are purely heuristic 
and no theoretical foundation is presented to explain their success. In fact, most of these 
methods use some kind of a voting mechanism to deal with rather weak classifiers. In the 
next section we will set up the stage to state the main result of this paper which motivates 
new algorithms for evaluating actions (or moves) at the chance nodes and hopefully will 
provide some understanding for the success of the already existing techniques in the future 
research. 

4. Mathematical Framework, Notion of Crossover/Recombination and 
Statement of the Finite Population Geiringer Theorem for Action 
Evaluation. 

Definition 2. Suppose we are given a chance node s = (s, a) and a sequence {cti}i^i of 
actions in a (it is possible that ai — aj for i ^ j). We may then call s a root state, or a 
state in question, the sequence {ai}^^i, the sequence of moves (actions) under evaluation 
and the set of moves A — {a\a = ai for some i with 1 < i < 6}, the set of actions (or 
moves) under evaluation. 

Definition 3. A rollout with respect to the state in question s = (s, a) and an action a G a 
is a sequence of states following the action a and ending with a terminal label / e S where 
S is an arbitrary set of labelo which looks as {(a, si, S2, ■ ■ ■ , St-i, /)}. For technical 
reasons which will become obvious later we will also require that Si ^ Sj for i j (it is 

''Intuitively, each terminal label in the set E represents a terminal state that we can assign a numerical value to 
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possible and common to have Si ^ Sj though). We will say that the total number of states 
in a rollout (which is fc — 1 in the notation of this definition) is the height of the rollout. 

Remark 4. Notice that in definition |3] we included only the initial move a made at the 
state in question (see definition |2]i which is the move under evaluation (see definition |2|. 
The moves between the intermediate states are chosen randomly and are not evaluated so 
that there is no reason to consider them. 

Remark 5. In subsection|3]we have introduced a convenient notation for states to empha- 
size their respective equivalence classes. With such notation a typical rollout would appear 
as a sequence {{a, {ii, oi), (12, 02), . . . , af_i), /)} with ij G N while a,; G A. 
According to the requirement in definition[3] ij = ik for j ^ k =^ au 7^ aj- 

A single rollout provides rather little information about an action particularly due to the 
combinatorial explosion in the branching factor of possible moves of the player and the op- 
ponents. Normally a large, yet comparable with total resource limitations, number of roll- 
outs is thrown to evaluate the actions at various positions. The challenging question which 
the current work addresses is how one can take full advantage of the parallel sequence of 
rollouts. Since the main idea is motivated by Geiringer theorem which is originated from 
population genetics (lUl) and later has also been involved in evolutionary computation the- 
ory ( ifTSil . lfT3l and lfT2l ') we shall exploit the terminology of the evolutionary computation 
community here. 

Definition 6. Given a state in question s = (s, a) and a sequence {ai}i^i of moves un- 
der evaluation (in the sense of definition |2]l then a population P with respect to the state 
s — (s,cf) and the sequence {aij^^i is a sequence of rollouts P = {rf-^^}'^^i where 
Ti = {{ai, s\, S2, ■ ■ ■ , s'li^if^i, fi)}- Just as in definition [3] we will assume that ^ 
whenever i j (which, in accordance with definition [3] is as strong as requiring that 
7^ *9 whenever i =1 j or k ^ gjB Moreover, we also assume that the terminal labels h 



mologous. Loosely speaking, a homologous population is one where equivalent states can 
not appear at different "heights". 

Remark 7. Each rollout r^^^ in definition |6] is started with the corresponding move Ui 
of the sequence of moves under evaluation (see definition |2]l. It is clear that if one were 
to permute the rollouts without changing the actual sequences of states the corresponding 

via a function (f> : S — > Q. Tlie reason we introduce the set S of formal labels as opposed to requiring that each 
terminal label is a rational number straight away, is to avoid confusion in the upcoming definitions 
''The last assumption that all the states in a population are formally distinct (although they may be equivalent) 
will be convenient later to extend the crossover operators from pairs to the entire populations. This assumption 
does make sense from the intuitive point of view as well since the exact state in most games involving randomness 
or incomplete information is simply unknown. 

'^This assumption does not reduce any generality since one can choose an arbitrary (possibly a many to one) 
assignment function </> : S — Q, yet the complexity of the statements of our main theorems will be mildly 
alleviated. 
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populations should provide identical values for the corresponding actions under evaluation. 
In fact, most authors in evolutionary computation theory (see ll22l . for instance) do assume 
that such populations are equivalent and deal with the corresponding equivalence classes of 
multisets corresponding to the individuals (these are sequences of rollouts). Nonetheless, 
when dealing with finite-population Geiringer-like theorems it is convenient for technical 
reasons which will become clear when the proof is presented (see also llT3l and llT2l ) to 
assume the ordered multiset model i.e. the populations are considered formally distinct 
when the individuals are permuted. Incidentally, ordered multiset models are useful for 
other types of theoretical analysis in 1 19] and [|20l . 

Example 8. A typical population with the convention as in remark|2]might look as below. 

a i—> la i—> 5a ^ 6a ^ 3(i t-^ 7a t-^ /i 

^ n> 2a i-> 1& i-> 3c M> 6d i-> /2 

7 i~> 4a i-> 6& 56 i-> /a 

a i-> Ic H- 46 26 n- 76 5c /4 

^ 3a 2c i-> 4c i-> /s 

^ 1^ 2d /e 

TT n- 36 n- Id M> 2e 6c /y 

The height of the first rollout in the population pictured above would then be 5 since it 
contains 5 states. The reader can easily see that the heights of the rollouts in this population 
read from top to bottom are 5, 4, 3, 5, 3, 1 and 4 respectively. Clearly, the total number of 
states within the population is the sum of the heights of all the rollouts in the population. 
In fact, this very simple observation is rather valuable when establishing the main result of 
the current article as will become clear in subsection l5.4l of section|5] 

The main idea is that the random actions taken at the equivalent states should be inter- 
changeable since they are chosen somehow at random during the simulation stage of the 
MCT algorithm. In the language of evolutionary computing, such a swap of moves is called 
a crossover Due to randomness or incomplete information (together with the equivalence 
relation which can be defined using the expert knowledge of a specific game being an- 
alyzed) in order to obtain the most out of a sample (population in our language) of the 
parallel rollouts it is desirable to explore all possible populations obtained by making var- 
ious swaps of the corresponding rollouts at the equivalent positions. Computationally this 
task seems expensive if one were to run the type of genetic programming described pre- 
cisely below, yet, it turns out that we can predict exactly what the limiting outcome of this 
"mixing procedure" would beElwe now continue with the rigorous definitions of crossover. 

Representation of rollouts suggested in remark|5]is convenient to define crossover op- 
erators for two given rollouts. We will introduce two crossover operations below. 

'^In this paper we will need to "inflate" the population first and then take the limit of a sequence of these limiting 
procedures as the inflation factor increases. All of this will be rigorously presented and discussed in subsection l4.2l 
and in section |5] 
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Definition 9. Given two rollouts ri = (ai, (ii, oi), (12, 02), . . . , aj(i-)_i), /) 

and r2 = (02, O'l, &i), {32, ^2), ■ • ■ , (it(2)-i, &t(2)-i), 5) of lengths t(l) and t{2) 
respectively that share no state in common (i.e., as in definition [5] ) there are 
two (non-homologous) crossover (or recombination) operators we introduce here. For 
an equivalence class label m g N and letters c, d G A define the one-point 
non-homologous crossover transformation Xm,c,d(^i: ^2) — {ti, ^2) where ti — 
(ai, (ii, ai), (12, 02), . . . , (ifc_i, ak-i), {jq, bg), {jq+i, bq+i), {jt{2)-i,bt(2)-i), 9) 
and t2 = 

(a2, O'l, bi), (j2, &2), • ■ • , {jq-i, &g-i), («fc, flfe), flfc+i), ■ • ■ , (it(i)-i,at(i)-i), /) 

if [i/c — jq — 1T1 and either (ofc = c and bq ~ d) or (a^ = d and &g = c)] and 
(ii, ^2) = (''I, otherwise. 

Likewise, we introduce a single position swap crossover Vm,c.d{rii ^2) = {vi, W2) 
where vi = 

(ofi, (ii, ai), (j2, 02), . . . , (ifc-i, afe_i), (jg, 6^), {ik+i, a/c+i), • • • , at(i)-i), /) 

while V2 — 

{a2, O'l, &i), O2, &2),---,0'«-i, ^g-i), («fc, flfc), 0'«+i, ^«+i),---,0't(2)-i,&t(2)-i), .9) 
if [ifc — jq = fn and either (a^ = c and 6g = d) or (a^ = d and bq — c)] and 
(I'l, ^^2) — {ti, r2) otherwise. In addition, a singe swap crossover is defined not only on 
the pairs of rollouts but also on a single rollout swapping equivalent states in the analogous 
manner: If 

r = (a, {ii, ai), (12, 02), . . . , {ij-i, Oj-i), {ij, aj), (i^+i, flj+i), . . . 

• • ■ , (jfc-i, flfc-i), flfc), (ifc+i, flfc+i), • ■ • , (*t(i)-i, at(i)-i), /) 
and [ij = ik and either (aj = c and = d) or (aj = d and = c)] then 

Vra,c,d{r) = (a, (ii, ai), (12, 02), . . . , (ij-i, Oj-i), (ij, Ofc), (ij+i, Oj+i), . . . 

. . . , (ife-i, afe_i), (ifc, (ifc+i, afc+i), . . . , (jt(i)-i, af(i)_i), /) 
and, of course, v,n^ c. d{r) fixes r (i.e. zy„i_ c,d{r) = r) otherwise. 

Remark 10. Notice that definition |9]makes sense thanks to the assumption that no rollout 
contains an identical pair of states in definition|3] 

Remark 11. Intuitively, performing one point crossover means that the corresponding 
player might have changed their strategy in a similar situation due to randomness and a 
single swap crossover corresponds to the player not knowing the exact state they are in due 
to incomplete information, for instance. 

Just as in case of defining crossover operators for pairs of rollouts, thanks to the assumption 
that all the states in a population of rollouts are formally distinct (see definition|6]l, it is easy 
to extend definition |9] to the entire populations of rollouts. In view of remarkfTT] to get the 
most informative picture out of the sequence of parallel rollouts one would want to run the 
genetic programming routine without selection and mutation and using only the crossover 
operators specified above for as long as possible and then, in order to evaluate a certain 
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move a, collect the weighted average of the terminal values (i. e. the values assigned to the 
terminal labels via some rational-valued assignment function) of all the rollouts starting 
with the move a which ever occurred in the process. We now describe precisely what the 
process is and give an example. 

Definition 12. Given a population P and a transformation of the form Xi. x, y, there exists 
at most one pair of distinct rollouts in the population P, namely the pair of rollouts ri 
and r2 such that the state {i, x) appears in ri and the state (i, y) appears in r2- If such 
a pair exists, then we define the recombination transformation Xi,x,y{P) — P' where 
P' is the population obtained from P by replacing the pair of rollouts (ri, with the 
pair Xi^x.yifi: ^2) as in definition|9] In any other case we do not make any change, i.e. 
Xi, x.y{P) — P- The transformation Vi, x.y{P) is defined in an entirely analogous manner 
with one more amendment: if the states {i, x) and (i, y) appear within the same individual 
(rollout), call it 

r = (a, (ji, ai), (ja, 02),..., (i, x),..., {i, y), ■ ■ ■ ,{it{i)-i,at(i)-i), /), 

and the state {i, x) precedes the state {i, y), then these states are interchanged obtaining 
the new rollout 

r' = [a, {ji, ai), {j2, 02), . . . , (i, y),..., {i, x), . . . , (it(i)„i, at(i)_i), /). 

Of course, it could be that the state (i, y) precedes the state (i, x) instead, in which case 
the definition would be analogous: if 

r = {a, (ji, oi), (j2, 02),..., (i, y),-.-, {i, x), . . . , (it(i)_i, at(i)„i), /) 
then replace the rollout r with the rollout 

r' = {a, Ui, ai), O2, 02), . . . , {i, x), . . . , {i, y), . . . , {it{i)-i, at{i)-i), /)• 

Remark 13. It is very important for the main theorem of our paper that each of the 
crossover transformations Xi,x,y and Vi^^.y is a bijection on their common domain, that 
is the set of all populations of rollouts at the specified chance node. As a matter of fact, 
the reader can easily verify by direct computation from definitions [T2l and |9] that each of 
the transformations Xi,x,y and i/i^ x, y is an involution on its domain, i.e. V i, x, y we have 
Xi, x,y — ^1, x,y — ^ where 1 is the identity transformation. 

Examples below illustrate the important extension of recombination operators to arbitrary 
populations pictorially. 

Example 14. Suppose we were to apply the recombination (crossover) operator xi, c d to 
the population of seven rollouts in example[8] Once the unique location of states (1, c) and 
(1, d) in the population has been identified (the first state in the forth rollout and the second 
state in the seventh rollout), applying the crossover operator xi.c,d yields the population 
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pictured below: 

a la I— > 5a ^ 6a 1-^ 3(i 1-^ 7a /i 
/3 n- 2a 16 ^ 3c ^ 6d H- /2 
7 n- 4a 6& ^ 5& /a 
a n- Id M> 2e ^ 6c n- /y 
^ n- 3a n- 2c 4c /5 
^ i-^ 2d /e 

TT i-> 3fe n- Ic 46 26 76 n> 5c H> /4 

On the other hand, applying the crossover transformation vi^c,d to the population in exam- 
ple |8]results in the population below: 

a ^ la ^ 5a H- > 6a H- > 3(i H- > 7a H- > /i 

2a 16 3c 6d /2 
7 n> 4a i-> 66 56 /s 
a n> Id i-> 46 n> 26 n> 76 i-> 5c /4 . 
^ 3a 2c i-> 4c /5 
C ^ 2(i /e 

TT 36 Ic ^ 2e 6c /y 

Example 15. Consider now the population Q pictured below: 

a n> 16 3c M> 6d M> /2 

/3 26 76 5c /4 

7 i-> 4a ^ 66 M> 5a M> 6a M. 3(i H> 7a H> /i 

a n> Id ^ 2c M> 4c M> /5 

^ i-> 3a M> 2e M> 6c M> /7 

^ 2d ^ /e 

TT M> 36 ^ Ic M> 46 M> 2a M> la n- 56 n- /a 

Suppose we apply the transformations x&, a, b and vq^ a. b to the population Q. The states 
(6, a) and (6, 6) both appear in the third rollout in the population Q. Since these states 
appear within the same rollout, according to definition [12] the crossover transformation 
Xe,a,b fixes the population Q (i.e. X6.a,b{Q) — Q)- On the other hand, the population 
z^6, a, b{Q) is pictured below: 

a n- 16 3c 6d ^ /2 

/3 K> 26 76 5c /4 

7 H> 4a I— > 6a i~> 5a h- > 66 M> 3d h- > 7a M> /i 

a Id ^ 2c ^ 4c /s 

^ 3a M> 2e 6c ^ /y 

^ H- 2d /e 

TT n- 36 ^ Ic 46 2a la 56 /a 
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Definition 16. Let n = {1, 2, . . . , n} denote the set of first n natural numbers. Consider 
any probability distribution /i on the set of all finite sequences of crossover transformations 



i^i, x,y \ X, y £ A and i G N}) j U {1} 

which assigns a positive probability to the singleton sequence^ and to the identity element 
1. (i.e. to every element of the subset 

5 = {1} U {{x^, x,y\x,y & A&ndi en}\J{v,^^^y\x,y e A and i e N})\ 

Given a sequence of transformations O — {Qi{j).x{j),y(j)}^=i where each is either x 
or V (i.e. M j either 9j(j), xO), y(j) = Xt{j).x{j).y{o) or Qt{j).x{j).y{j) = ^t{j).x{j).yij)), 
consider the transformation 

= Qi{n).x{n).y{n) ° ©((n-l), y(n-l) O ■ • • O Qi{2). x{2), y{2) ° ©1(1), y(l) 

on the set of all populations starting at the specified chance node obtained by composing all 
the transformations in the sequence G. The identity element 1 stands for the identity map 
on the set of all possible populations of rollouts. Now define the Markov transition Matrix 
Mfj_ on the set of all populations of rollouts (see definition |6] and remark |5]l as follows: 
given populations X and Y of the same size k, the probability of obtaining the population 
Y from the population X after performing a single crossover stage, px^Y = f^iSx^v) 
where 

Sx-^Y {r I r e J- and T{T){X) = Y} 

where 

'e ifr = e 

The identity map if F = 1 . 
Example [T7]below illustrates the first part of definition [161 
Example 17. Consider the sequence of five recombination transformations 

= (Xi, c.rfi X2,c,e; X5. a. b: \l. a. b: \2.a.b)- 

According to definition[T6]the sequence 8 gives rise to the composed recombination trans- 
formation 

= X2,a,6 O Xl.a.b ° X5, a.b ° X2. c, e O Xl.c.d- 

The reader may verify as a small exercise that &{P) — Q where P is the population 
displayed on figure ?? while the population Q is the one appearing in figure ??. If one were 
to append the recombination transformation z/g, a, b to the sequence of rollouts 9 obtaining 
the sequence 

ei = (xi,c,<i, 



T{T) 



"This technical assumption may be altered in vaiious manner as long as the induced Markov chain remains 
irreducible. 
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then, by associativity of composition, we have 6i = i^6,a,& o © so that 9i(P) = 
i'Q,a,b{Q{P)) — i'6,a,b{Q) where Q, as above, is the population displayed on figure ?? 
so that, according to example[T5] the population Qi{P) is the one appearing in figure ??. 

Remark 18. Evidently the map T : F ^ introduced at the end of definition [T6l can 
be regarded as a random variable on the set F described at the beginning of definition [T6l 
where P denotes the set of all populations of rollouts containing k individuals so that P^ 
is the set of all endomorphisms (functions with the same domain and codomain) on P 
and the probability measure /it on P^ is the "pushforward" measure induced by T, i.e. 
Ht{S) = /x(T~^(S'))|l To alleviate the complexity of verbal (or written) presentation we 
will usually abuse the language and use the set F in place of P^ so that a transformation 
F G P^ is identified with the entire set T^^{F) E T. For example, 

if we write ii{{F\F e J"andi^(X) = Y})we mean ^{{T \r e F and T{T){X) = Y}). 

It may be worth pointing out that the set is not necessarily a singleton, i.e. the map T 
is not one-to-one as example [T9]below demonstrates. 

Example 19. Consider any i j and any a, b, c and d E A. Notice that the transforma- 
tions I'i^ a, b and lyj^ c. d commute since the order in which elements of distinct equivalence 
classes are interchanged within the same population of rollouts is irrelevant. Thus the se- 
quences xi — {vi^a,b, Vj^cd) and X2 = {vj.c.di Vi,a.b) inducc exactly the same trans- 
formation Q on the set of populations of rollouts. Here is another very important example. 
Notice that every transformation a, b where Q could be either x or v is an involution 
on the set of populations of rollouts i.e. Gi_ a,b ° ©i. a,b = & where e is the identity map 
since performing a swap at identical positions twice brings back the original population of 
rollouts. Therefore any ordered pair {Qi^a,b, &i,a,b) of repeated transformations induce 
exactly the same transformation as the symbol 1, namely the identity transformation on the 
population of rollouts. 

One more remark is in order here. 

Remark 20. Notice that any concatenation of sequences in F (which is what corresponds 
to the composition of the corresponding functions) stays in F. In other words, the family 
of maps induced by F is closed under composition. 

Of course, running the Markov process induced by the transition matrix in definition [16] 
infinitely long is impossible, but fortunately one does not have to do it. The central idea 
of the current paper is that the limiting outcome as time goes to infinity can be predicted 
exactly using the Geiringer-like theory and the desired evaluations of moves can be well- 
estimated at rather little computational cost in most cases. As pointed out in example [T9] 
above, each of the transformations 9^^ a, b is an involution and, in particular, is bijective. 

^The sigma algebra on is the one generated by T with respect to the sigma-algebra that is originally chosen 
on however in practical applications the sets involved are finite and so all the sigma-algebras can be safely 
assumed to be power sets. 
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Therefore, every composition of these transformations is a bijection as well. We deduce, 
thereby, that the family consists of bijections only (see remarkfTSll. The finite population 
Geiringer theorem (see |fT3l ) now applies and tells us the following: 

Definition 21. Given populations P and Q of rollouts at a specified state in question as in 
definition |6] (see also remark |5]l, we say that P ^ Q if there is a transformation F E T 
such thatQ = F{P). 

Theorem 22 (The Geiringer Theorem for POMDPs) The relation ^ introduced in defi- 
nition\27\is an equivalence relation. Given a population P of rollouts at a specified state 
in question, the restriction of the Markov transition matrix introduced in definition [76] to 
the equivalence class [P] of the population P under ^ is a well-defined Markov transition 
matrix which induces an irreducible and aperiodic Markov chain on [P] and the unique 
stationary distribution of this Markov chain is the uniform distribution on [P]. 

In fact, thanks to the application of the classical contraction mapping principl^ described 
in section |6] of the current paper (namely theorem [STl interested reader is welcome to fa- 
miliarize themselves with section |6] although this is not essential to understand the main 
objective of the paper), the stationary distribution is uniform in a rather strong sense de- 
scribed below. 

Theorem 23. Suppose we are given finitely many probability measures /ii, /i2, • • • , IJ-n on 
the collection of sequences of transformations T as in deftnition \16\ where each probabil- 
ity measure fii satisfies the conditions of definition |76] Denote by Mi the corresponding 
Markov transition matrix induced by the probability measure fii. Let A4 = {Mi}fLi. Now 
consider the following stochastic process {($n, ^n)}5^o state space A4 x [P] 

where [P] is the equivalence class of the initial population of rollouts at the state in ques- 
tion as in theorem \22\ $„ is an arbitrary stochastic process (not necessarily Markovian) 
on M. which satisfies the following requirement: 

The random variable $„ is independent of the random variables X„, X„+i, ... (1) 

The random variable $o arbitrary while Xq = P ( recall that P is the initial population 
of rollouts at the node in question) with probability 1. 

Vn £ N the probability distribution of the random variable Xn, namely 

Prob{X^ = ■)= $„-i(w) • Prob{X„^i = •)• (2) 

It follows then that lim„_i.oo Prob{Xn = •) = tt where tt is the uniform distribution on [P]. 

We now pause and take some time to interpret theorem |23] intuitively. Example l24l below 
illustrates a scenario where theorem |23] applies . 

^This simple and elegant classical result about complete metric spaces lies in the heart of many important theo- 
rems such as the "existence uniqueness" theorem in the theory of differential equations, for instance. 
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Example 24. Consider the set S of all finite sequences of populations in the equivalence 
class [P] of the initial population P which start with the initial population P (notice that S 
is a countably infinite set since [P] is a finite set). Intuitively, each sequence in S represents 
prior history. Every sequence P = P, Pi , P2 , P3 , . . . , Pf is associated with a probabil- 
ity measure rj{P) on the set of populations [P]. Suppose further that to every population 
Q e [P] we assign a probability measure on the family of recombination transfor- 
mations induced by F where each measure jiq satisfies definition [16] Intuitively, each 
probability distribution /ig might represent the probability that the swaps (or sequences of 
swaps) are reasonable to perform in a specific population regardless of the knowledge of 
the prior history or experience in playing the game, for instance. Starting with the initial 
population P we apply the probability measure 77 (P) (here P denotes a singleton sequence) 
to obtain a population Qi e [P] . Independently we now apply the Markov transition matrix 
induced by the probability measure fip to obtain another population Pi G [P] . Next, we 
select a population Q2 with respect to the probability measure r/(P, Pi ) and, again indepen- 
dently, apply the Markov transition matrix /ig^ to the population Pi to obtain a population 
P2 in the next generation. Continuing recursively, let's say after time t e N we obtained 
a population Qt at step t and a sequence of populations Pt — P, Pi, P2, . . . , Pf. Select a 
population Qt+i with respect to the probability measure ri{Pt). Independently select a pop- 
ulation Pf+i via an application of the Markov transition matrix induced by the probability 
measure to the population Pf . Theoreml23]applies now and tells us that in the limit as 
t — >^ 00 we are equally likely to encounter any population Q G [P] regardless of the choice 
of the measures involved as long as the probability measures jiq satisfy definition [16] A 
word of caution is in order here: it is not in vain that we emphasize that selection is made 
"independently" here. Theorem|23] simply does not hold without this assumption. 

Evidently example [24] represents just one of numerous possible interpretations of theo- 
rem [23] We hope that other authors will elaborate on this point. Knowing that the limiting 
frequency of occurrence of a any two given populations Qi and Q2 G [P] is the same, 
it is possible to compute the limiting frequency of occurrence of any specific rollout and 
even certain subsets of rollouts using the machinery developed in |13| and [12] which is 
also presented in section ?? of the current paper for the sake of self-containment. To state 
and derive these "Geiringer-like" results we need to introduce the appropriate notions of 
schemata (see, for instance, fj) and ifTTl ) here. 

4.1. Schemata for MCT Algorithm 

Definition 25. Given a state (s,a) in question (see definition O, a rollout Holland-Poli 
schema is a sequence consisting of entries from the setd? UNU{#}Ul]of the form 
h — {xi]^^^i for some fc G N such that for fc > 1 we have xi G (5, a;^ G N 
when 1 < i < fc represents an equivalence class of states, and Xk G {#} U E 
could represent either a terminal label if it is a member of the set of terminal labels 
E, or any substring defining a vaUd rollout if it is a 7^ sign0 For fc = 1 there is a 

''This notion of a schema is somewhat of a mixture between Holland's and Poll's notions. 
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unique schema of the form Every schema uniquely determines a set of rollouts Sh = 

{(xi, (x2,a2), (x^^as), . . . ,{xk-i,ak-i),Xk) 

\ai e A for 1 < i < k} if fc > 1 and Xk S S 

{{xi, {X2,a2), {X3,a3), . . . ,{xk^i,ak^i), 
{yk,ak), (yfc+i,afc+i),...,/) 

\ai G A for 1 < i < k, Uj ^ N and aj £ A} if fc > 1 and = # 

the entire set of all possible rollouts if fc = 1 or, equivalently, h — ^. 

which fit the schema in the sense mentioned above. We will often abuse the language and 
use the same word schema to mean either the schema /i as a formal sequence as above or 
schema as a set of rollouts which fit the schema. For example, if h and h* is a schema, 
we will write h D h* as a shorthand notation for Sh n Sh* where n denotes the usual in- 
tersection of sets. Just as in definition[3] we will say that fc — 1, the number of states in the 
schema h is the height of the schema h. 

We illustrate the important notion of a schema with an example below: 

Example 26. Suppose we are given a schema h = {a, 1, 2, Then the rollouts 
{a, la, 2c, 5a, 3c, /) and {a, Id, 2a, 3a, 3d, g) £ S*/! or one could say that both of them 
fit the schema h. On the other hand the rollout {(3, la, 2c, 5a, 3c, /) ^ Sh (or does not fit 
the schema h) unless a ~ (3. A rollout (a, la, 3a, 5a, 3c, /) ^ Sh does not fit the schema 
h either since 2:2 = 2 7^ 3. Neither of the rollouts above fit the schema h* = {a, 1, 2, /) 
since the appropriate terminal label is not reached in the 4* position. An instance of a 
rollout which fits the schema h* would be {a, Ic, 2b, /). 

The notion of schema is useful for stating and proving Geiringer-like results largely thanks 
to the following notion of partial order. 

Definition 27. Given schemata h and g we will write h > g either if 

h = # and ,g ^ # or h = {xi, X2, X3, . . . , Xk-i, #) while g = 
{xi, X2, X3,..., Xk-1, Vk, Vk+i,- ■ ■ , yi-1, yi) where yi could be either of the allowable 
values: a # or a terminal label f E T.. However, if = # then we require that I > fc. 

An obvious fact following immediately from definitions |25] and l27l is the following. 

Proposition 28. Suppose we are given schemata h and g. Then h > g Sh ^ Sg. 

4.2. The Statement of Geiringer-like Theorems for the POMDPs 

In evolutionary computation Geiringer-like results address the limiting frequency of oc- 
currence of a set of individuals fitting a certain schema (see ifTSll . ifTSll and lfT2l ). In this 
work our theory rests on the finite population model based on stationary distribution of the 
Markov chain of all populations potentially encountered in the process (see theorems |22] 
and|23]and example l24b. The "limiting frequency of occurrence" (rigorous definition ap- 
pears in section|5] subsection 15. 21 definitions l42l and l45l however for the readers who aim 
only at "calculus-level" understanding with the goal of applying the main ideas directly in 
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their software engineering work we will discuss the intuitive idea in more detail below) 
of a certain subset of individuals determined by a HoUand-Poli schema h among all the 
populations in the equivalence class [P] as time increases (i.e. as t —> oo) of the initial 
population of rollouts P will be expressed solely in terms of the initial population P and 
schema h. These quantities are defined below. 

Definition 29. For any action under evaluation a define a set-valued function a ]^ from 
the set il^ of populations of rollouts to the power set of the set of natural numbers V{H) 
as follows: a \. (P) = G N and at least one of the rollouts in the population P 

fits the Holland schema {a, i, #)}. Likewise, for an equivalence class label i G N define 
a set valued function on the populations of size 6, as j J, (P) = { j | 3 a: and y G A 
and a rollout r in the population P such that r = {. . . x), (j, y), . . .) } U {/ | / £ E 
and 3 an X E A and a rollout r in the population P such that r = {. . . , {i, x), f) }. 
In words, the set i ], (P) is the set of all equivalence classes together with the terminal 
labels which appear after the equivalence class i in at least one of the rollouts from the 
population P. Finally, introduce one more function, namely i ^s: 51^ N U {0} by letting 
* is (P) = |{/l / G Snii (-P)}|, that is, the total number of terminal labels (which are 
assumed to be all formally distinct for convenience) following the equivalence class i in a 
rollout of the population P. 

As always, we illustrate definition |29] in example [SOlbelow. 

Example 30. Continuing with example|8] we return to the population P in figure ??. From 
the picture we see that the only equivalence class i such that a rollout from the population 
P fits the Holland schema (a, i, is i = 1 so that a I (P) = {1}. Likewise, the 
only equivalence class following the action /? is 2, the only equivalence class following the 
action 7 is 4 and the only one following tt is 3 so that (3 i (P) = {2}, 7 ^ (P) = {4} and 
TT X (P) — {3}. The only equivalence classes i following the action ^ in the population P 
are i = 3 and i = 2 so that the set ^ | (P) = {2, 3}. 

Likewise the fragment (1, a), (5, a) appears in the first (leftmost) rollout in P, 
(1, b), (3, c) in the second rollout, (1, c), (4, b) in the forth tollout and (1, d), (2, e) in the 
last, seventh rollout. No other equivalence class or a terminal label follows the equivalence 
class of the state 1 in the population P and so it follows that 1 I (P) = {5, 3, 4, 2} and 

1 is (P) = |{0}| = 0. Likewise, equivalence class 1 follows the equivalence class 2 in the 
second rollout, 7 follows 2 in the forth rollout, 4 follows 2 in the fifth rollout and 6 follows 

2 in the last, seventh rollout. The only terminal label that follows the equivalence class 2 is 
fe in the 6* rollout. Thus we have 2 4. (P) = {1, 7, 4, 6, /g} and 2 Is (P) = K/ell = 1- 
We leave the reader to verify that 

3 i (P) = {7, 6, 2, 1} so that 3 is (P) = 0, 

4 ; (P) = {6, 2, /s} so that 4 is (P) = 1, 



5 i (P) = {6, /3, /4} and so 5 is (P) = 2, 



October 24, 2011 2:34 Emerald/INSTRUCTION FILE InvitedSubmittedFirst- 
DraftForArchive 



Geiringer Theorem. Partially Observable Markov decision Processes and other Monte-Carlo search Methods. 17 

6 I (P) - {3, 5, /2, M and so 6 (P) = 2 
and, finally, 7 HP) ^ {5, /i} so that 7 Is (P) = 1- 

Remark 31. Note that according to the assumption that all the terminal labels within the 
same population are distinct (see definition |6] together with the comment in the footnote 
there). But then, since every rollout ends with a terminal label, we must have * 4-s 

(P) = b (of course, only finitely many summands, namely these equivalence classes that 
appear in the population P may contribute nonzero values to J^iLi * 4-s (-P)) where b is 
the number of rollouts in the population P, i.e. the size of the population P. For instance, in 
example[30lfe = 7 and there are totally 7 equivalence classes, namely 1, 2, 3, 4, 5, 6 and 7 
that occur within the population in figure ?? so that we have X^i^i * is {P) ~ * is 

(P) = 0+1 + 0+1 + 2 + 2 + 1 = 7 = 5. 

Another important and related definition we need to introduce is the following: 

Definition 32. Given a population P and integers i and j E N representing equivalence 
classes, let 



Order(z;j, P) 



if i{P) = or j (^ii{P) 

|{((i,a), {j, b)) I the segment 

{{i,a), {j, b)) appears in one of the 

rollouts in the population P} | otherwise 



Loosely speaking, Order(z j, P) is the total number of times the equivalence class j 
follows the equivalence class i within the population of rollouts P. 

Likewise, given a population of rollouts P, an action a under evaluation and an integer 
i e N, let 



Order(a i j, P) 



ifi(P) = Oor j ^ a; j 

1 {(a, (j, b)) I the segment 

{a, {j, b)) appears in one of the 

rollouts in the population P} | otherwise 



Alternatively, Order(a j, P) is the number of rollouts in the population P fitting the 
rollout Holland schema {a, j, #). 

We now provide an example to illustrate definition [32l 

Example 33. Continuing with example [30] and population P appearing in figure ??, we 
recall that a I (P) = {1}. we immediately deduce that Order(a, j, #) = unless j = 1. 
There are two rollouts, namely the first and the forth, that fit the schema {a, 1, #) so that 
Order(Q! | 1, P) — 2. Likewise, f3 | (P) = {2} and exactly one rollout, namely the 
second one, fits the Holland schema (/3, 2, ^) so that Order(/3, j, ^) = Q unless j = 2 
while Order(/3 2, P) = 1. Continuing in this manner (the reader may want to look back 
at example [30b. we list all the nonzero values of the function Order(action, □, P) for the 
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population P in figure ??: Order(7 | 4, P) = Order(^ ; 3, P) = Order(^ i 2, P) = 
Order(7r J, 3, P) = 1. 

Likewise, recall from example [30l that 1 J. (P) = {5, 3, 4, 2} so that Order(l 4^ 
j, P) — unless j = 5 or j = 3 or j = 4 or j = 1. It happens so that a unique rollout 
exists in the population P fitting each fragment (1, (j, something in A)) for j = 5, j — 3, 
j — 4: and j — 2 respectively, namely the first, the second, the forth and the last (seventh) 
rollouts. According to definition [32] we then have Order(l J. 5, P) = Order(l l 3, P) — 
Order(l J. 4, P) = Order(l ; 2, P) = 1. Analogously, 2 ; (P) ^ {1, 7, 4, 6, /e} so 
that Order(2 l- j, P) = unless j — 1, 7, 4 or 6. The only rollout in the population 
P involving the fragment with 1 following 2 is the second one, the only one involving 
7 following 2 is the forth, the only one involving 4 following 2 is the fifth, and the only 
one involving 6 following 2 is the last (the seventh) rollouts respectively so that Order(2 
1, P) Order(2 ; 7, P) Order(2 J, 4, P) = Order(2 i 6, P) = 1. Continuing in 
this manner, we list all the remaining nonzero values of the "Order" function introduced in 
definition [32] for the population P in figure ??: 

Order(3 i 7, P) = Order(3 I 6, P) = Order(3 I 2, P) = Order(3 i 1, P) = 1, 
Order(4 i 6, P) = Order(4 ; 2, P) = 1, 

Order(5 i 6, P) = Order(6 I 3, P) = Order(6 i 5, P) = Order(7 J, 5, P) = 1. 

Remark 34. It must be noted that all the functions introduced in definitions |29] and [32] 
remain invariant if one were to apply the "primitive" recombination transformations from 
the family S as in definitions[T6]and[T2]to the population in the argument. More explicitly, 
given any population of rollouts P, an action a under evaluation, an equivalence class 
i G N, a Holland-Poli schema h — {a, ii, i2, ■ ■ ■ , ik-i,Xk) an integer j with 1 < j < k, 
and any recombination transformation G 5, we have 

a ; (p) - a ; (7^(p)), * ; (p) - 1 ; (7^(P)), 

I is (P) - I is (7^(P)), Order((7 i r, P) = Order(g ; r, n{P)). 

Indeed, the reader may easily verify that performing a swap of the elements of the same 
equivalence class, or of the corresponding subtrees pruned at equivalent labels, preserves 
all the states which are present within the population and creates no new ones. Moreover, 
the equivalence class sequel is also preserved and hence the invariance of the functions a 
and i | etc. follows. Since every transformation in the family 7^ is a composition of the 
crossover transformations from the family S, it follows at once that all of the functions 
introduced in definitions [29] and [32] are constant on the equivalence classes of populations 
under the equivalence relation introduced in definition[2T] 

Example 35. Recall from example [T4l that the populations in figures ??, ?? and ?? are 
equivalent and, likewise, according to example [TSl the populations in figures ?? and ?? 
are equivalent. Moreover, example [19] demonstrates that the populations displayed in fig- 
ures ?? and ?? are also equivalent. Thus all of the populations that appear in figures ??, 
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??, ??, ?? and ?? belong to the same equivalence class under the relation ^ introduced in 
definition [21] In view of remark [34l all the functions appearing in definitions |29] and [32] 
produce identical values on the populations displayed on figures ??, ??, ??, ?? and ?? 

Observe that applying any recombination transformation of the form Xi, a, fc or q j, to a 
population P of rollouts neither removes any states from the population nor adds any new 
ones, and hence the following invariance property of the equivalent populations that will 
largely alleviate theoretical analysis in section|5]follows. 

Remark 36. Given any population Q E [P], the total number of states in the population Q 
is the same as that in the population P. Apparently, as we already mentioned, the the total 
number of states in a population is the sum of the heights of all rollouts in that population 
(see definition [3] and |6]l. It follows then, that the sum of the heights of all rollouts within a 
population is an invariant quantity under the equivalence relation in definition |2T| In other 
words, if Q ^ P then the sum of the heights of the rollouts in the population Q is the same 
as the sum of the heights of the rollouts in the population P. 

There is yet one more important notion, namely that of the "limiting frequency of occur- 
rence" of a schema as one runs the genetic programming routine with recombination only 
we need to introduce to state the Geiringer-like results of the current paper A rigorous 
definition in the most general framework appears in subsection l5.2l of section |5] (namely, 
definitions l42l and l45Tl, nonetheless, for less patient readers, who aim only at the "calculus 
level" understanding, we explain informally what the limiting frequency of occurrence is. 

Informal Description of the Limiting Frequency of Occurrence: Given a schema h and 
a population P of size m, suppose we run the Markov process {X„}5^q on the popula- 
tions in the equivalence class [P] of the initial population of rollouts P as in definition [T6l 
or, more generally, the non-homogenous time Markov process as described in theorem |23] 
(where the Markov transition matrices introduced in definition[T6]are chosen randomly with 
respect to another stochastic process (not necessarily Markovian) that does not depend on 
the current population but may depend on the entire history of former populations as well 
as on other external parameters independent of the current population). As discussed in the 
preceding paragraph, this corresponds to "running the genetic programming routine for- 
ever" and each recombination models the changes in player's strategies due to incomplete 
information, randomness personality etc. Up to time t a total of m ■ t individuals (count- 
ing repetitions) have been encountered. Among these a certain number, say h{t), fit the 
schema h in the sense of definition|25l We now let <f>(P, ft, t) — '-^ to be the proportion 
of these individuals fitting the schema h out of the total number of individuals encountered 
up to time t. It follows from theorem|22]via the instruments presented in section |52] (also 
available in 1 13 1 and [?]) that limt_^oo ^{Pi h, t) exists and the formula for it will be given 
purely in terms of the parameters of the initial population P (more specifically, in terms 
of the functions described in definitions [29] and [32] Although it may be possible to derive 
the formulas for limt^oo h, t) in the most general case when the initial population 
of rollouts P is non-homologous (in other words when the states representing the same 
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equivalence class may appear at various "heights" in the same population of rollouts: see 
definition |6|l, the formulas obtained in this manner would definitely be significantly more 
cumbersome and would not be as well suited for algorithm development as the limiting 
result with respect to "inflating" the initial population P in the sense described below. 
Remarkably, the formula for the limiting result in the general non-homologous initial pop- 
ulation case coincides with the one for the homologous populations. 

Definition 37. Given a population P = {r^^^}\^l of rollouts in the sense of definition|6l 
where = {il,a\), (jl, 4), . • . , a*(.j_ J fi)} and a positive integer™, we 

first increase the size of the alphabet A by a factor of to: formally, let the alphabet 

Ax m = {{a, i) \ a e A, i e N and 1 < « < to}. 

Likewise, we also increase the terminal set of labels E by a factor of to so that 

S X m = {(/, i) I / e S, i e N and 1 < i < m}. 

Now we let 
where 

^2 UIK, k)), A:)),...,(j7(,)_i,(ai(,)_i, fc)), (/.,A:))}. 

We will say that the population P,„ is an inflation of the population P by a factor of to. 

Essentially, a population Pm consists of to formally distinct copies of each rollout in the 
population P. Intuitively speaking, the stochastic information captured in the sample of 
rollouts comprising the population P„i (such as the frequency of obtaining a state in the 
equivalence class of j after a state in the equivalence class of i) is the same as the one 
contained within the population P emphasized by the factor of m. In fact, the following 
rather important obvious facts make some of this intuition precise: 

Proposition 38. Given a population P of rollouts and a positive integer to consider the 
inflation of the population P by a factor of m, P^ as in definition \37\ Then the following 
are true: 

a i {P,n) = a ; (P), I i {P,n) = I i Is (Pm) = TO • Us (P) 

while 

Order(Q; I j, P,n) = m ■ Order(a I j, P), Order(g J, r, P^) = m • Order(g I r, P) (3) 

For any population of rollouts Q let Total(Q) denote the total number of states in the 
population Q which is, of course, the same thing as the sum of the heights of all rollouts in 
the population Q. Then clearly Total(P„j) = m ■ Total(P). In the special case when P is a 
homologous population, V m E N so is the population Pm- 



'This is an open question, yet it's practical importance is highly unclear 
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When using Holland-Poli schemata with respect to any population Q e [Pm] we will adopt 
the following convention: 

Given a Holland-Poli schema h = {a, zi, i2, . . . , ik-i, f ) and a population Q e [Pm], 
an individual (i.e. a rollout) r of the population Q fits the schema h if and only if it is 
of the form r = {a, (zi, (ai, ji)), (z2, (02, ^2)) • • • , {ik-i,ak-i,jk-i), (/, Jfc))- Infor- 
mally speaking, everything is as in definition|25]with the exception that the terminal symbol 
of the schema h, namely / G S while the terminal symbol of the rollout r is an ordered 
pair of the terminal symbol / coupled with a numerical label between 1 and m so that we 
require only the first element of the ordered pair, namely the function label /, to match. 

We are finally ready to state the main result of the current paper. 

Theorem 40 (The Geiringer-Like Theorem for MCT) Repeat verbatim the assump- 
tions of theorem\23\ Let 



where Xk G {4t^} \JY, be a given Holland-Poli schema. For m G N consider the random 
variable ft,, t) described in the paragraph just above (alternatively, a rigorous def- 

inition in the most general framework appears in subsection \5.2\ of section\5\ definitions \42\ 
and \45\l with respect to the Markov process X™ where m indicates that the initial popu- 
lation of rollouts is the inflated population Pm as in definition \37\ with the new alphabet 
A X m labeling the states (see also exat7iple \24\f or help with understanding of the Markov 
process Xn). Then 



h = {a, ii, i2, - ■ ■ ,ik-i,Xk) 



lim lim ^{Pm, h, 



t) 



Order(Q; J, ii, P) 
h 



X 




(4) 



where 




1 ifxk 
ifxk 
Fraction ifxk 



# 

f ^Y,andf i Xk-i | {P) 
/ G S and f G Xk-i Is (P) 



where 



1 



Fraction = 



Order(zfc_i, J, P) + ik-i is (P) 



(we write "LF" as short for "Last Factor"). Furthermore, in the special case when the 
initial population P is homologous ( see definition \6}, one does not need to take the limit 
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as TO — > cxD in the sense that linif_i.oo '^{Prn, h, t) is a constant independent of m and its 
value is given by the right hand side of eciuation \40^ 

An important comment is in order here: it is possible that the denominator of one of the 
fractions involved in the product is 0. However, in such a case, the numerator is also and 
we adopt the convention (in this theorem only) that if the numerator is then, regardless 
of the value of the denominator (i.e. even if the denominator is 0), then the fraction is 0. 
As a matter of fact, a denominator of some fraction involved is if and only if one of the 
following holds: a{P) — Q or if there exists an index q with 1 < q < k — \ such that no 
state in the equivalence class of iq appears in the population P (and hence in either of the 
inflated populations Pm)- 

Theorem|40]tells us that given any Holland-Poli rollout schema and a generating population 
P, V e > 3a sufficiently large AI so that the right hand side of equation |40] provides 
an approximation of the limiting frequency of occurrence of the set of rollouts fitting the 
schema h starting with the initial population P„i which is the inflation of the population P 
by a factor of to > M, namely limt_>.oo ^{Pm, h, t), with an error at most e. 

Theorem|40]is the main result of the current work. It motivates a variety of algorithms 
for evaluating the actions based on the entire, fairly large and seemingly pairwise discon- 
nected sample of independent parallel rollouts that fully take advantage of the exponen- 
tially many possibilities already available within that sample and, at the same time, should 
be rather efficient in many situations. These algorithms will be the subject of sequel papers. 

5. Deriving Geiringer-like Theorems for POMDPs 

5.1. Setting, Notation and the General Finite-Population Geiringer Theorem 

Throughout section |5] (the current section) the following notation will be used: VL is aflnite 
set, called a search space. We fix an integer 6 G N and we call fl^ — { {xi , X2, . . .xi,)\xi e 
fl} the set of populations of size b; every element x — (xi, X2, . . .Xb e n'' is called 
a population of size b and every element x G f2 is called an individual. Notice that we 
prefer to think of a population as a "column vector" (hence the "transpose symbol"). Of 
course, this is just the matter of preference, but normally when we list the individuals it is 
natural to write each individual as a string of "genes or alleles" which appear on the same 
row and so the b individuals appear on b separate rows. It is important to emphasize here 
that populations are ordered 6-tuples so that {xi, X2, ■ • • Xb)'^ ^ (xb, X2, • • • xi)^ unless 
Xi = Xb- By a. family of recombination transformations we mean a family of functions 
F — {F I P : ri'' — > Vt^}. The general finite population Geiringer theorem then says the 
following: 

Theorem 41 (The Finite Population Geiringer Theorem for Evolutionary Algorithms) 

Suppose we are given a probability measure on the family of recombination transforma- 

j The case of homologous recombination has been established in a different but mathematically equivalent frame- 
work in 1131 and 1121 nonetheless we will derive it along with the general fact expressed in eauation l4Ql to illustrate 
the newly enhanced methodology based on the lumping quotients of Markov chains described in subsection l5.3l 
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Hons T on the set of populations ^ of size b as described above. Suppose further there is a 
subfamily S C T which generates the entire family T in the sense that \/ F E J- 3 a finite 
sequence of transformations Si, S2, ■ ■ ■ , Si E S such that F — Si o S2 ° ■ ■ ■ o Si. Assume 
the fallowing about the probability measure 11: 

\/ S £ S we have ii{S) > 0. (5) 

The identity map 1 : if' is in S (6) 

Most importantly, assume that every recombination transformation S E S is bijective (i.e. 
a one-to-one and onto function on ft^). Consider the Markov transition matrix M with 
state space fl^ defined as follows: given populations x and y £ fl^, we let 

Ps^ff^fi{{F\FeJ^andF{x)^y}). (7) 

Now define a relation ^ on ft^ as follows: x ^ y if and only i/B fc G N and recombination 
transformations Fi, F2, . . . , Fk £ F such that [Fi o ° ■ • • o Fk]{x) = y. We now assert 
the following facts: 

~ is an equivalence relation. (8) 

Given an equivalence class of some population x, call it [x], 

the restriction of the Markov transition matrix M to [x] 

is a well-defined Markov transition matrix on the state space [x] , call it M| [jj . (9) 

Vx e fi'' the Markov transition matrix Af|[j] is doubly stochastic and 

it defines an irreducible and aperiodic Markov chain on [x\ . (10) 

Vx e 51'' the unique stationary distribution of M\^g^ is the uniform distribution on [x\. 

(11) 

Theorem I4T] is a simple yet elegant consequence from basic group theory. In this paper we 
assume that the reader is familiar with fundamental notions about groups and group actions. 
Nearly any standard textbook in Abstract Algebra such as, for instance, |7| contains way 
more group theoretic material than necessary for our purpose. For a brief introduction we 
invite the reader to study llT3l . 

Proof. Since the family of transformations S consists entirely of bijections and any com- 
position of bijections is also a bijection, the family F also consists solely of bijections. It 
follows then that the family F generates a subgroup G of the group of all permutations on 
the finite set il^. Notice that the probability measure /i naturally extends to the entire group 
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I /i(g) if g £ J- 

G generated by JF by defining /iext(<7) = % • Clearly the Markov process 

I otherwise. 

defined in the statement of theorem|4T](see|7]) can be redefined as 

Px^y = Mext({.g I g e G and = y}). (12) 

Furthermore, notice that the group G is of size no bigger than < oo since < oo. 
It follows then that every element g E G can be written as a finite composition g = 
Fio F20 . . .o Fk for Fi, F2, . ■ . ,Fk E T (because every element _F G C G is a torsion 
element of G i.e. = 1 for some Z € N so that = F^^). But then the relation ~ can 
be redefined as x ^ yif and only if 3 g G G such that g(x) = y. We now quickly recognize 
that the relation ^ is the orbit-defining equivalence relation which partitions the set of all 
populations of size 6, il'', into the orbits under the action of the group G. The assertions 
expressed in equations [8] and |9] now follow at once. To verify equation [TOl we choose any 
y eVI^ and compute directly 

^ ps-,y = ^ Mext({g I 5 e G and g{x) = y}) = 

= X! A*ext({ff I5 e G and g^^{y) = x}) = /Xext(G) = 1 
Sen'' 

since the sets K{x) = {g\g E G and g~^{y) = x} clearly form a partition of G. We have 
now shown that the Markov transition matrix M is doubly stochastic. Irreducibility follows 
from finiteness together with the fact that S generates F. Since 1 E S, aperiodicity follows 
as well. Now the classical result about Markov chains tells us that there is unique stationary 
distribution and since AI is doubly stochastic it must be the uniform distribution so that the 
final assertions expressed in equations [TOl and [TTIfollow at once. □ 



5.2. A Methodology for the Derivation of Geiringer-like Results 

The classical Geiringer theorem (see lH)) from population genetics tells us something about 
the "limiting frequency of occurrence of certain individuals in a population" rather than re- 
ferring to the limiting distribution of populations. In fact, the mathematical model of the 
classical Geiringer theorem in [8J is entirely different from that of the finite-population 
Geiringer theorem described in the previous section. Nonetheless, the finite-population 
Markov chain model is much more suited when dealing with evolutionary algorithms since 
all the structures, including the search space and populations, in the computational setting 
are finite and the model in |;13J and |fT2| as well as in the current paper describes ex- 
actly what happens during a stochastic simulation. Knowing that some stochastic process 
{Xt}^Q on some equivalence class of populations [x] tends to the uniform distribution 
over the populations (i.e. VyE [x] we have limt_>.oo P{Xt — y) = l/|[a;]|) it is often 
possible to deduce what we call Geiringer-like theorems which express the hmiting fre- 
quency of occurrence of specific individuals and specific sets of individuals in terms of 
the information contained in a single representative of the equivalence class only (say, the 
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initial population). Of course, we need to formulate precisely what the "limiting frequency 
of occurrence" is. 

Definition 42. Consider a function X : Vin) x O'' ^ {0, 1, 2, . . . , 6} where V^D,) de- 
notes the power set of fl (i.e. the set of all subsets of il) and il^ is the set of all pop- 
ulations of size b, as usual, defined as follows: given a subset S C fl and a population 
X — {xi, X2, ■ ■ ■ ,Xb) G r^^, we define a function X{S,x) = \{i \ < i < b, Xi E S} \ to 
be the number of individuals in the population x which belong to the subset S (counting 
their multiplicities). 

Example 43. Let's say S — {a} is a singleton set, 6 = 3 and x — {u, v, u) where v. 

Then X(S., x) = 2 since xi = x^ ^ u E S while X2 = v ^ S. 

Remark 44. Observe that if we fix a subset S C il and let the second argument in the 
function X vary, then we get a function of one variable X{S, □) : ^ {0, 1, 2, . . . , 6} 
defined naturally by plugging a population of size b in place of the □. 

Definition 45. Choose a subset S* C 17 an equivalence class [x] of populations of size b and 
let {Xt}flQ be any stochastic process on [x] (x could be an initial population, for instance). 
It makes sense now to define a random variable 

$(5, f, t) = ^^o^^S,X,) ^ 

Clearly the random variable x, t) counts the fraction of occurrence (or frequency of 
encountering) the individuals from the set S before time t. In general Ymvt^^ ^{S^ x, t) 
does not exist. However, under "nice" ckcumstances described below everything works out 
rather well. 

Lemma 46. Suppose there is an "attractor" probability distribution p on the equivalence 
class [x] for the stochastic process {Xt}^Q in the sense that if Xq — x with probability 
1 then \im.t^^ P{Xt = ■) = p where P{Xt = ■) denotes the probability distribution of 
the random variable Xt which can be thought of in terms of a vector in R'I^I' so that the 
limf_i.oo is taken with respect to the Li norm, let's sa]^ Then 

mT^<f(5, f, t) = ^Ep{X{S, D)\[s]) 

where Ep denotes the expectation with respect to the probability distribution p on [x], 
while X{S., is the restriction of the function X{S, □) introduced in remark \44\ to the 

equivalence class [x\. 

A sketch of the proof. Consider a "constant" stochastic process Yt where each random 
variable Yt is distributed according to p. By assumption \\P{Xt — ■) — P{Yt = 



''It is well-known that any two norms on finite dimensional real or complex vector spaces are equivalent so that 
the choice of the norm is irrelevant here 
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as i — ^ OO. On the other hand, by the law of large numbers, 

Ep {X{S, U)\s]) = /nn after routme e-de,a,ls ELo ^i^, X,) ^ 

= 5 . lim ^iS, Xi) ^ ^.^ ^^^^ 

so that the desired assertion follows after dividing both sides of the equation above by h. □ 

In our specific case, thanks to theorem|4T] the probability distribution p in lemmal46lis the 
uniform distribution on the equivalence class [x\. 
Notice that a random variable 

b 

X{S,U)=Y,I,{S,U) (13) 

1=1 

where 1^(5, □) is the indicator function of the i^h individual in the argument population 
with respect to the membership in the subset S. More explicitly, if we are given a population 

X — (xi, a;2, ... Xf,)^ then 

US,x) = \ J . (14) 
[0 otherwise. 

Assume now that all transpositions of individuals within the same population are 
among the transformations in the family S (see the statement of theorem W\\ . 
In other words, Vi < j the transformation Ti_j sending a population x = 
(xi, X2, . . . ,Xi_i, Xi, Xi+i, Xj-i, Xj, Xj+1, . . . ,XbY' into the population Ti^j{x) = 
(xi, X2, . . . , Xi_i, Xj, Xi+i, . . . , Xj_i, Xi, Xj+i, . . . , Xb)^ has positive probability of be- 
ing chosen. Notice that this is usually a very reasonable assumption since the order of 
individuals in a population should not matter in practical applications. Then we immedi- 
ately deduce that any given population y E [x] if and only if the corresponding population 
Ti_ j (y) obtained by swapping the and the j* individuals in the population y is a member 
of [x]. When p is the uniform distribution (as in theorem |4TI). this is equivalent to saying 
that all the indicator random variables Ii{S, □) defined in equation[T4]above are identically 
distributed independently of the index i. In particular, they are all distributed as Ii(S', □). 
Using equation [T3] together with linearity of expectation, we now deduce that if tt denotes 
the uniform distribution on [x] then 

b 

{x{s, n)!!^]) ^Y.^- (^*(^' = ^-E. {MS, ^[s]) = 



b ■ 7r({y I y = (yi, 2/2, ... , Vbf 6 [x] and y, e S}) ^ b ■ (15) 



where 



V(x, 5) = {y\y^ (yi, 2/2, ... , Vbf e [x] and yi G S} (16) 
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is the subset of [x] consisting solely of populations in [x] the first individuals of which are 
members of the subset S* C Jl. combining equation [15] with the conclusion of lemma |46l 
immediately produces the following very useful fact. 

Lemma 47. Under exactly the same setting and assumptions as in theorem |47] together 
with an additional assumption that all the "swap " transformations defined and discussed 
in the paragraph following equation\T4\are members of the subfamily S of the family J- of 
recombination transformations, it is true that V S ^ Q we have 

lim $(5, t) . 

where the set V(x, 5') is defined in \16\ 

Lemma HT] allows us to derive Geiringer-like theorems in a rather straightforward fashion 
for several classes of evolutionary algorithms via the following simple strategy: suppose 
we are given a subset S C fl. According to lemma l47l all we have to do to compute the 
desired limiting frequency of occurrence of a certain subset 5 C il is to calculate the 
ratio For some subsets of the search space such a ratio is quite obvious, yet for 

others it may be combinatorially non-achievable. In evolutionary computation, it is often 
possible to define an appropriate notion of schemata (this is precisely what we have done 
in section |4TT] for the case of MCT) which has, intuitively speaking, a "product-like flavor" 
that allows us to exploit the following observation: suppose we can find a sequence of 
subsets 5*1 3 S'2 2 ■ • ■ 2 Sn-i ^ Sn — S. We can then write 

^ \Vi^,S)\ _ \Vix,S)\ |V(x, 5„-i)| \V{x, _ 

^ ' ' ^ m\ |V(x, 5„_i)| \V{x,Sn-2)\ \[x]\ 

by Iemmas|42]and|46] 1 ^ . s \V{x, Sk+l)\ 

-E, {XiS„ □)![,)) . [[ |^(. ^^)| (17) 

The idea is that the individual ratios in the right hand side of equation [TT] may be quite 
simple to compute as happens to be the case when deriving finite population Geiringer-like 
theorems for GP with homologous crossover (see |fT3l and lfT2l ). When deriving the fi- 
nite population version Geiringer-like theorem with non-homologous recombination in the 
limit of large population size, rather than computing the ratios in equation[T7] we will in- 
stead estimate each one of them from above and from below exploiting the main Geiringer 
theorem (theoremlTTTl together with the methodology for estimating the stationary distribu- 
tions of Markov chains based on the lumping quotient construction appearing in (| 14|, [TSl 
and [15 J). All of the necessary apparatus and one enhanced lemma will be summarized and 
presented in the next subsection for the sake of completeness. 



5.3. Lumping Quotients of Markov Chains and Markov Inequality 

Throughout the current subsection we shall be dealing with a Markov chain A4 (not nec- 
essarily irreducible) over a finite state space X. {px^y} denotes the Markov transition 
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matrix with the convention that p^^y is the probability of getting y in the next stage given 
X. Let TT denote a stationary distribution of the Markov chain M (here we will assume 
that at least one stationary distribution does exist). Furthermore we will assume that the 
stationary distribution tt has the property that Vx G A" ti{x) ^ 0. Suppose we are given 
an equivalence relation ~ partitioning the state space X. The aim of the current section is 
to construct a Markov chain over the equivalence classes under ^ (i.e. over the set X / ^) 
whose stationary distribution is compatible with the distribution tt and then to exploit the 
constructed lumped quotient chain to estimate certain ratios of the stationary distribution 
values. In fact, this methodology has been successfully used to establish some properties 
of the stationary distributions of the irreducible Markov chains modeling a wide class of 
evolutionary algorithms (see llT4l . lfT6l and [il5i ). 

Definition 48. Given a Markov chain M over a finite state space X determined by the 
transition matrix {p^^y}, an equivalence relation ~ on X, and a stationary distribution 
TT of the Markov chain M. satisfying the property that V.t G A' 7r(a;) ^ 0, define the 
quotient Markov chain M./ ^ over the state space X / ^ of equivalence classes via ^ to 
be determined by the transition matrix {pu^v}u. vex /~ given as 



Here Px-s-v denotes the transition probability of getting somewhere inside of V given x. 



Intuitively, the quotient Markov chain A^/ ^ is obtained by running the original chain M 
starting with the stationary distribution tt and computing the transition probabilities of the 
assiciated stochastic process conditioned with respect to the stationary input. Thereby, the 
following fact should not be a surprise: 

Tlieorem 49. Let tt denote a stationary distribution of a Markov chain Ai determined by 
the transition matrix {px->-y}x,yeA' cind having the property that Va; G <Y t:{x) ^ 0. 
Suppose we are given an equivalence relation ^ partitioning the state space X. Then the 
probability distribution tt defined as tt{{0}) — 7r(0) is a stationary distribution of the 
quotient Markov chain M-j ^ assigning nonzero probability to every state {i.e. to every 
equivalence class under ^). 

Proof: This fact can be verified by direct computation. Indeed, we obtain 




Since V ~ [Jy^\;{y} it follows thatpx^v 
holds. 



J2yevP^^y ^^'^ hence the equation above 



^ H{0})-po^u 



1 



oex/~ 



oex/~ 



xgo zeu 




by stationarity of tt 



xeX zeU 



zeu xex 
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This establishes the stationarity of tt and theorem|49]now follows. □ 

Although theorem|49]is rather elementary it allows us to deduce interesting and insightful 
results (see [|14i . Iil6i and [il5i ) via the observations presented below. To state these results 
it is convenient to generalize the notion of transition probabilities in the following manner 
(which is coherent with definition l48Tl: 



Definition 50. Given a Markov chain M with state space X and a stationary distribution tt 

■J.eA 7r(A) 



for any two subsets A and £? C A", we define pa^b = X^aeA TOTPa^s where Pa^B = 



■ b- 



Remark 51. It is worth emphasizing that in case when B — Aor A\^B = the transition 
probabilities pa^b are precisely the transition probabilities of various quotient Markov 
chains with states which have A and B as their states according to definition|48] In partic- 
ular, if we consider the quotient Markov chain comprised of the states, A and A"^' where A"^ 
denotes the complement of A, we have 1 — pa^a = Pa^A'= ■ 

In the current paper we will use a lumping quotient chain consisting of only 2 equivalence 
classes, A and B ^ A'^ (i.e. the complement of A in the state space X). For a 2 by 2 Markov 
transition matrix we easily see that if tt denotes the unique stationary distribution of the 
original Markov chain A4 then, thanks to theorem|49l we have Tr{A)pA^A+'^{B)pB^A = 
Tr{A) so that tt{B)pb^a — — pa^a) = 7i'(^)PA-fS and, if neither A nor B is 

empty, we have 

7r(A) _ pB^A 
n[B) pa^b 

Equation[T8j tells us that in order to estimate the ratio of the stationary distribution values of 
the Markov chain on a pair of complementary subsets of the state space A and B — A'^, 
it is sufficient to estimate the ratio of the generalized transition probabilities pb^a and 
PA^B- Although these transition probabilities do depend on the stationary distribution 
itself, it is sometimes possible to estimate them using a convexity-based bound appearing 
in (1 14j, I.16J and [15J). For the purpose of the present work we need to introduce a mild 
generalization of this bound appearing below: 

Lemma 52. Suppose, as in definition \50\ A and B C X and [/ C A" such that 

TriUnA) 

— < e < 1. 

niA) - 

Suppose further that for some constant 1^ with < k < 1 the following is true: W a G AClU'^ 
wehavepa^B < k- Then we have pa^b < (1 — e)K+e. Dually, assume that for a constant 
A with < X < 1 it is true that V a S AnU'^ we have Pa^B > A. Then pa^b ^ (1 ~ ^)^- 

Proof. Indeed, we have 

^ iT{a) ^ -Kja) ^ 7r(a) 
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Notice that 

Ilia) _Ti{Ar\U'') _ 7r(C/nA) 

while 0< 4^ = ^^^<^ (20) 

aeAnu ^ ' ^ ' 

The desired inequalities now follow when we plug in the bounds in the assumptions into 
equation [19] and then use the inequalities in equation |20] together with the fact that proba- 
bilities are always between and 1 . □ 

In a special case when U — % lemma|52]entails the following. 

Corollary 53. Given any two subsets A and B Q X, if for some constant k with < k < 1 
it is true that V a G A we have Pa^B ^ then Pa^b ^ Dually, if for some constant A 
with < A < 1 if is true that \/ a € A we have Pa^ s > A then pA-^ s > A. Consequently, 
if for some constant 7 it happens that V a 6 >1 we have Pa^B — 7 then Pa-^b = 1- 

Combining equation[T8]with lemma|52]readily gives us the following. 

Lemma 54. Suppose A and B C X is a complementary pair of subsets (i.e. AD B = 
and AU B = X). Suppose further that U C X is such that 

, < e < 1 and , ^' <S <1. 
tt{A) 7r{B) 

Assume now that we find constants Ai, A2, ki and K2 such that \/b £ V d B we have 
Ai < Pb^A < 1^1 andy a Elf^ClAwe have A2 < Pa^B < '^2- Then we have 

{1-S)\i ^ njA) ^ {l-d)Ki+S 



{l-e)K2 + e - Tr{B) " (1 - e)A2 

In order to apply lemma |54] effectively we need to know that both, ^'"^(^^^ and ^^^g"^"* 
are small. As we shall see in the next subsection, the inductive hypothesis will imply that 
at least one of these ratios is small. The following simple lemma will allow us to deduce 
that the remaining ratio is also small as long as a certain ratio of generalized transition 
probabilities is bounded below. 

Lemma 55. Suppose A and B Q X with A r\ B — (" notice that we do not require 
AUB = X). Then 

Pb^a 



T7{A) > 7r(S) 



PA^A^ 



Proof. Let C — X D {AU By. Consider the lumped Markov chain on the state space 
{A, B, C}. Since vr is the stationary distribution of the Markov chain A4, by theorem |49] 
(see also definition l50l and remarklSTTi we have 



Tr{A) = tt{B)pb^a + n{A)pA^A + ■n{C)pc^A 
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SO that 

(1 - pa^a)t^{A) = tt{B)pb^a + tt[C)pc^a > tt{B)pb^a 

since probabilities are nonnegative. The desired conclusion now follows when dividing 
both sides of the inequality above by 1 — pa^a = Pa^A"^ ■ □ 

Finally, there is another very simple and general classical inequality that will be elegantly 
exploited in the next section to set the stage for the application of lemmal54l allowing us to 
avoid unpleasant combinatorial complications. 

Lemma 56 (Markov Inequality) Suppose H is a non-negative valued random variable 
on a probability space Q with probability measure Pr. Then V A > we have 

< Pr(i7 > A • E{H)) < \ as \^ oo. 

A 

Proof. By definition of expectation we have 

r by positivity oi H f 

E{H) = i HdPr > / HdPr > Pr(iJ > A • E{H)) ■ (A ■ E{H)). 

Jn Jh>\e(h) 

Now, if Pr(iJ > 0) = then iJ almost surely so that E{H) = and 

Pr(i7 > A • E{H)) = Pi{H >0) = 0<\. 

A 

Otherwise, Pr{H > 0) > =^ E{H) = /^-^ HdPr > and the desired inequality follows 
when dividing both sides of the equation above by A • E(H). □ 

We end this section with a very well-known elementary fact about Markov chains having 
symmetric transition matrices that will also be used in the proof of theoreml40l 

Proposition 57. Let M be any Markov chain determined by a symmetric transition matrix. 
Then the uniform distribution is a stationary distribution of the Markov chain M ( notice 
that M is not assumed to be irreducible). 



Proof. The reader may easily see that the Markov transition matrix is doubly-stochastic or 
verify that the uniform distribution is stationary directly from the detailed balance equa- 
tions!] □ 



5.4. Deriving the Geiringer-like Theorem (Theorem\40\ for the MCT algorithm 

We now recall the setting of section |4] At first we will prove the theorem for a mildly 
extended family of recombination transformations F where in addition to the transforma- 
tions in definition [12] also contains all the transpositions (or swaps) of the rollouts in a 
population and these are selected with positive probability (a detailed description appears 



'This is also a particular case of the well-known reversibility property of Markov chains. 
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in paragraph following equation O. Since every transposition of rollouts is a bijection on 
the set of all populations, theorem |4T] still applies, except that the equivalence classes will 
be enlarged by a factor of {b ■ m)\ i.e. [Pm] ^ — {b ■ m)] ■ [Pm]j' (this is so because ev- 
ery permutation is a composition of transpositions). Thanks to the assumption we will be 
in a position to apply the tools based on lemma |47] namely equation [17] This assumption 
will be dropped at the end via apparent symmetry considerations. Indeed, any permutation 
TT of the rollouts in a population Q G [Pm] naturally commutes with all the recombina- 
tion transformations in definition [16] thereby providing a family of bijections between the 
equivalence class and each of the {b ■ m)l disjoint pieces comprising the partition 

of the equivalence class [Pm] ^- Furthermore, permutations preserve the multisets of roll- 
outs within a population so that the frequencies of occurrence of various subsets in the 
corresponding pieces will be preserved and, thereby, the conclusion of theorem|40]with the 
family of recombination transformations T replaced by will be exactly the same. 
Recall the schema 

h = {a, ii, 12, . . . ,ik-i, Xk) 

of height fc — 1 > in the statement of theorem [40] Notice that thanks to proposition [28] 
we can write the given schema h as 

h = hk hk-i C hk-2 C . . . C /i2 C /ii (21) 

where hi = {a, ii, #) and, in general, when 1 < j < k 

hj = {a, ii, 12, . . . , ij, #) 

are Holland schemata. Thanks to equation[T7] V to e N we have 

liM P,,., t) . \e, (^(,.„ . n ^^^§^r^ (22) 

and, taking the hmit as m oo, 

lim lim Pm, t) = 

m— >oo t^oo 



fc-1 



= \ lim E, (X{h„ □)|[p,„]^) . n lim 'y|,f"" ' (23) 

where p is the uniform distribution on [P]m- First of all, notice that V, to G N the 
random variable X{hi, □)|[p^^]. is a constant function which is equal to Order(Q: 4- 
*i, Pm) — Order(a J, ii, P) (see remark [34l and pi'oposition[38]l. It follows trivially then 
that (^X{hi, □)|[p^^]^^ = Order(a ], ii, P) giving us the first ratio factor in the right 
hand side of equation|40] In particular, when h — hi is a schema of height ending with a 
there is no need to take the limit as m ^ oo regardless of whether or not the population 
P is homologous. To deal with the remaining ratios in the general case, when the popula- 
tion P is not necessarily homologous, we will exploit the classical and elementary Markov 
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inequality (lemma |56] in a rather elegant manner) to set up the stage for the application of 
lemmas |52land|54lin the following manner. 

Consider the random variable Hi : [Pm] N where [Pm] is equipped with the uniform 
probability measure p, measuring the height of the rollout in the population Q G [Pm]. 
In other words, 

Hi{Q) = the height of the i* rollout in the population Q. 

Notice that V, i and j with l<i<j<b-m the random variables Hi and Hj are 
identically distributed (indeed, thanks to theorem [JT] the swap of the rollouts i and j in 
the population P„j is an isomorphism of the probability space [Pm] with itself, call it r, 
such that HiOT — Hj and vice versa). In particular, these random variables have the same 
expectation. Thanks to remark[36]and proposition[38l we deduce that 

E(H,) ^ ^^W) ^ ^(^'-"^-) ^ 
b ■ m b ■ m 

^ Total(P^) ^ m ■ Total(P) ^ Total(P) ^^4) 
b ■ m b ■ jn b 

Notice that the right hand side of equation |24] does not depend on m. In other words, 
V m G N the expected height of the first rollout in the population P,,, is the same and is 
equal to l£l2i(£l At the same time, according to proposition[38l the functions 

Order(Q! 4- j, Pm) -> 00 and Order(i \. j, Pm) — ?> 00 as m — 00. (25) 

The above observation opens the door for the application of Markov inequality that will, 
in turn, allow us to exploit lemmal54lwith the aim of estimating the desired ratios involved 
in equation [17] and then showing that the upper and the lower bounds on these fractions 
converge to the corresponding ratios involved in the right hand side of equation |40] in the 
conclusion of the statement of theorem |40] We now proceed in detail. Let 5 > be an 
arbitrary small number (informally speaking, S ^ 1). Choose M G N large enough so that 

S^.M>^-^^EiH,) 
b 

(see equation |24]|. For m > M let 

^{Q[Qe[Pm] and Hi{Q)>S-m}. (26) 
and observe that the Markov inequality (lemmal56]l tells us that 



1 \ since m>M 

P{UL) = P{{Q I Hi{Q) >6-m})^p\H,> --{5^ ■m)] < 



and by definition of U^^^ in equation|26] / 1 \ by Markov inequality \ 

< piHi>--E{Hi)\ < 1/^='^ (27) 

where p denotes the uniform probability distribution on the set [Pm]- 
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As the reader probably anticipates by now, our aim is to show that each of the ratios of 
the form 



lim 



|V(F„i, ^5+1)1 _ Order(iq | i^+i, P) 



m^^ |V(P™, h,)\ Eje,,i(P) Order(*, ; j, P) + i, is (P) 

so that equation l40l in the conclusion of theorem |40] would follow from equation l22l when 
taking the limit of both sides as m ^ 00. First of all, let us take care of the "trivial ex- 
tremes" when for some q with 1 < q < k — 1 we have either (Order(iq_i I iq, P) ~ 0) or 
(V j ^ iq we have Order(ig_i I j, P) = and iq^i Is (P) — 0)) or ((xk G S) and (either 
ik-i is (P) = or Xk i ik-i i (P)) or (Vj £ N we have Order(^fc_l i j, P) = and 
Xk is the only terminal label member of the set ik-i i P i.e. Zfc_i i P fl S = {xk})) or 
= #). According to proposition l38l the statement above holds for a population P if 
and only if V 771 e N it holds when the population P is replaced with P„i. In the case when 
either Order(ig_i i iq, P„i) = or ik-i is (Pm) = or Xk ^ ik-i i (P), no individual 
fitting the schema h is present in any population Q G [Pm] so that V m and < G N we have 
4>(P„i, h, t) — 0. Thereby the left hand side of equation l40l is trivially 0. The right hand 
side is as well in this case since the numerator of one of the fractions in the product is 
(see the convention remark in the statement of theoreml40b. This finishes the verification of 
one trivial extreme case. Suppose now for some index q it is the case that \/ j ^ iq we have 
Order(jg_i i j, P) = and is (P) = 0. In this case we observe that any individual 
occurring in a population Q G [Pm] which fits the schema hq^i, also fits the schema hq. In 
particular, the sets V(Pm, /ig+i) and V(Pm, hq) are equal and we trivially have Vtti G N 
''iv^p" ^fe^)/ 1- Of course, the corresponding ratio 

Order(iq i ig+i, P) ^ 

T^jeuUP) Order(i, i j, P) + iq is (P) ^ 

as well since Order(iq i j, P) is the only nonzero contributing summand in the denomi- 
nator The last factor ratio is supposed to coincide with the ratio \v(p^''^h^\) \ ■ This ratio 
is either or 1 in the extreme cases and verifying the validity of equation |40] is entirely 
analogous to the above. We now move on to the interesting case when none of the trivial 
extremes above happen. For schemata x and y we write x\y — Sx ^ {SyY (see def- 
inition |25] | to denote the set of rollouts fitting the schema x and not fitting the schema 
y. Rather than estimating or, in case of homologous population P, evaluating exactly the 
ratios of the form "^y^p' ''^^)|^ estimate and, in case of homologous recombination, 

evaluate the ratios of the form ,Sfi^"^,' \''u^^^ m since these are more convenient to tackle 



using the tools in section 1531 The following very simple fact demonstrates the connection 
between the two: 

Lemma 58. Suppose that V m whenever 1 < q < fc — 1 

\V{Pm, hq+i)[ Order(zq j iq+i, P) 



|V(P™, hq \ hq+,)\ E,e.a(P) andj^.,+, 0'-der(7, i j, P) + iq is (P) 
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and neither the numerator nor the denominator of any of the fractions is 0. Then 

\V{P.m, hq+i)\ ^ Oi-dei-(z,;i,+i, P) 

|V(P™, h,)\ E,-6i,^(P) Order(i, ; j, P) + i, Is (P) ' 

Likewise, if 

|V(Pm, hq+i)\ _ Order(ig 4 P) 



lim 



|V(P„., E,e»a(P)and,^»,+,Oi-der(*aJ, ^) + ^gis (P) 

and for all sufficiently large m neither the numerator nor the denominator of any of the 
fractions involved vanishes, then 

lim _ Order(i, I ig+i, P) 



- |V(Pm, K)\ E,6.,i(P) Order(*, i J, P) + is (P) " 

Proof. Clearly V(Pm, /i^) — V(Pm, ^g+i) W V(Pm, \ ^q+i) where ttJ emphasizes that 
this is a union of disjoint sets. The rest is just a matter of careful verification: we have 

|V(P^, |V(Pm, 1 ^28^ 

|V(P„>,)| |V(P„,>,+i)I + |V(P^, 1+ '^(^-'-A^+j)! - 

Taking the limit as m — > 00 on both sides of equationl28]vields 

™^oo |V(P™,MI i + Hni„^^n;%fi^^ 

The right hand sides of equations |28] and |29] are easily computed directly from the corre- 
sponding formulas in the assumptions and each of them is: 

1 



^ S^je. , 4(f) and jVi^^i Order(»,ij, P)+t^i^(P) 
' Order(i,4,i,+ i , P) 

1 

Order(»,;»,+i,P) EjEi,4.(P) and OrM^^ii, -P)+'aB ) 

Oi-der(i, ^1,4-1, P) Order(i<j4,i,4.i , P) 



Order(i, I i^+i, P) 



yielding the asserted conclusions. □ 

Entirely analogously, 
Lemma 59. Suppose that V m 

|V(P„„ hk)\ 1 



|V(P™, /ife_i Eje,,_a(P)0'"'^^'"(*'«-i + is (P) - 1 

and the denominators do not vanish. Then 

|V(Pm, hk)\ ^ 1 

|V(P™, hk-i)\ Y.Je^,-MP) Order(ifc_i ; j, P) + u-i is (P) ' 
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Likewise, if 

j.^ \V{Pm,h,)\ _ 1 



|V(P™, /ifc_i \ hk)\ Eje»,_a(P) Order(zfe_i ; j, P) + Is (Z^) - 1 

and for all sufficiently large m the denominators of any of the fractions involved vanishes, 
then 

|V(P™, hk)\ _ 1 



lim 



|V(P™, hk-i)\ Eje,,_ii(P) Order(zfc_i ; j, P) + z^^i Is {P) ' 



To estimate or, in the special case of homologous population P, to compute exactly, the 
ratios | v(p^"/i '^\h^^-i)\ following strategy will be employed. For a given m G N consider 
the set of all populations V(Pm, hq) (i.e. the set of these populations in [P,,,] the first 
individual of which fits the schema hq). Let now tt^^ „i denote the uniform probability 
measure on the set V{P,m hq). We then have 



\V{P,n, hq+i)\ _ ' |V(fV.,C)l^' _ ■nq,m{V{Pm, hq+i)) 



(30) 



(31) 



|V(P„, hq \ Vl)l ""'fv^pJ^XT'^' \ 

and, more generally, V set of rollouts S, 

V(p„i, /ig+inS)| 

\V{Prn, hq+i fl S)\ _ |V(P„.,fe,)| _ TTg. ( V(P,n , hg+i Cl S)) 

|V(P„„ ^ \ ^)| " |V(P„.. A»,)y^ + l)l - 7r,,„,(V(P,„, hq \ hq+,)) 

V(_Pm 1 ; I 

The idea behind equations[30]and[3T|is to construct a Markov chain with a uniform station- 
ary distribution on the state space V(P„i, hq) thereby opening the door to an application 
of lemma|54] It seems the easiest construction to accomplish our task uses proposition l57l 
Recall the transformations of the form Xi, x, y as in definitions|9]and[T2]from definition[T6l 
We now construct our Markov chain, call it Mq, on the set V(Pm, hq) where g < fc as 
follows: given a population of rollouts Qt G V(Pm, hq) at time t, let {iq, x) be the state in 
the first rollout and position in the population Qt. Consider the set 

States„i(z, I Qt) = {{j, z)\j eiq i {Qt), z G A x m and the state (j, z) 
appears in the population Qt following a state with equivalence class iq}U 

U{{f,j)\l<j <mandf eiqi^P}. (32) 

Now select a state or a terminal label; call either one of these v, from the set finite set 
StateSm(zg I Qt) uniformly at random. Since each state appears uniquely in a population 
Qt, by definition of the set StateSm(iq 4- Qt) in[32l the state preceding the element v 
selected from States,„(ig | Qt), call it u, is of the form u = {iq, y) where y ^ A x m. 
Now let Qt+i — Xiq,x.y{Qt)- Notice that there are two mutually exclusive cases here: 

Case 1: The states u and {iq, x) appear in different rollouts (or, equivalently, the state 
u does not appear in the first rollout since the state {iq, x) does by definition). In this case 
Qt+i 7^ Qt and the state in the first rollout of the population Qt+i in the q + 1^' position 
is V. In this case we will say that the element v is mobile. 
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Case 2: The states u and (iq, x) appear in the same rollout (of course, it has to be the 
first rollout). In this case Qt+i — Qt- We will say that the element v is immobile. 

Notice that in either of the cases, the population Qt+i £ V{Prm hq) so that the Markov 
process is well defined on the set of populations V{P„n hq) C [P„j]. We now emphasize 
the following simple important facts: 

Lemma 60. VQ G V(Pm, hq) |States„j(ig i Q)| = m- |Statesi(i^ | P)\ and 
|Statesi(*, ; P)\ = E,e.,i(P) Order(z, i j, P) + iq is {P) 

Proof. The fact that [Statesi(i, i P)| = Ejg*,4,(p) Order(ig i j, P) + iq is (P) follows 
directly from the definitions. Definition of the set States,„(iq J, Q) in [32] together with 
remark[34ltell us that States„i(ig i Q) = Statesi(ig I P„i) (where P,n plays the role of P 
for the time being) so that 

States™(ig i Q) ^ |Statesi(ig i P„i)\ ^ Order(ig i j, P„) + iq is {Pm) = 

jei,i(P„.) 

byp.opos...o„Ei ^ m.Ordcr{^qiJ,P)+m■^qi^{P) = 

J2 Order(zaj, P) + ^, (^) ) m■\Smcs^{^q i P)\. 

Another very simple important observation is the following: 

Lemma 61. Given any two populations Q and Q' G V(P,„, hq), let Pq^q, denote the 
transition probability of the Markov chain Aiq as constructed above. Then either Pq^q, = 
^''Pq-^Q' = m.|States^i(»,iP)| - Moreover, p^^g, = p^,^^ and the uniform distribution 
is a stationary distribution of the Markov chain A4q. 

Proof. From the construction it is clear that if Pq^qi ^ then there must be an element 
s £ StateSm(ig i Q) which appears in a rollout in the population Q different from the 
first one and it is the state at the q*^ position of the first rollout of the population Q' while 
definition [T6l tells us that the state {iq, x) in the q*'' position of the first rollout of the 
population Q appears in Q' in some rollout that is not the first one (the former position of 
the state s that is now in position q of the first rollout of Q') and it is also a member of the 
set StateSm(ig i Q') according to the way StateSm(ig i Q') is introduced in[32] According 
to lemma l60l States„, (iq i Q) — States,„(jg i Q') = m ■ |Statesi(iq i Q)\ so that the 
desired conclusion that p'q_^q, — Pq'^q follows from the construction of the Markov 
chain A4q. The uniform probability distribution is a stationary distribution of the Markov 
chain Mq since we have just shown that the Markov transition matrix is symmetric (see 
also propositionlSTll. □ 



Recall the generalized transition probabilities introduced in definition|50] For the remaining 
part of this section it is convenient to introduce the following definition: 
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Definition 62. Given a population Q G V(Pm, hq+i), let Mobileg((5) denote the number 
of mobile elements (see case 1 above) in the set States„j(ig | Q) that move the population 
Q away from the set V(Pm, hq+i) (and hence, into the set V(Pm, hq \ hq+i)) under the 
application of the Markov chain AAq as constructed above. Dually, given Q G V{Pm, hq \ 
hq+i), let Mobileq(Q) denote the number of mobile elements in the set StateSrn,{iq i Q) 
that move the population Q away from the set V(P„i, hq \ /i^+i) (and hence, into the set 

V(F™, hq+l)). 

Suppose, for the time being, that the set V{Pm, ^g+i) 7^ 0- Given a population Q £ 
V{Pm, hq-f-i), notice that 

Mobile,(g) < 1^^--^^ ""''^'^'^ ^ ^'^^^^ + (33) 

lEjG«,iQOj"der(i, i j)(Q) +1, is (Q) - m if q = k - I 

Notice that in case the population P is homologous (and hence so are Pm and Q) there 
are no immobile elements in the population Q so that the inequality |33] turns into an exact 
equation. In general, from case 2 above it is clear that the total number of all the immobile 
elements is crudely bounded above by the height of the first rollout in the population Q, 
Hi{Q). We now obtain a lower bound on the total number of mobile elements in the set 
StateSm(i(j i Q) that move the population Q away from the set V(Pm, ^9+1) into the set 
V(Pm, hq \ hq+i): this number is at least 

Mobile5(Q) > 



> fE,e.,4Qa„d,^.,+.Order(zaj)(Q)+*as(Q)-i^i(Q) if<Z<fc-l 
" lE,6,,iQOrder(i, ij)(Q) |s (Q) - m - ffi(Q) if q = fc - 1 

Analogously, if the population Q e V{Pm, hq \ /ig+i) then the total number of mo- 
bile elements in the set States,„(jg J, Q) that move the population Q away from the set 
V{Pm, hq \ /ig+i) (and hence, into the set V{Pm, hq+i)) 



Mobile,(Q)<|°^"'^^'"(*^^^'+^^(^) ifg<fc-l ^33^ 
I m if q — fc — 1 



and, as before, the inequality turns into an exact equation in the case when Q is a homolo- 
gous population. At the same time 

^/r K-1 ^ /Order(*,;»,+i)(Q)-^^i(Q) ifg<fc-l 

Mobile, (g) > <^ „ -f 1. 1 ^^^^ 

\m~Hi[Q) if q = k — 1 

In view of proposition 1381 and remark [34l inequalities [33l [34l |35] and [36l can be rewritten 
verbatim replacing Order(ig I iq+i){Q) with ni ■ Order(ig 4. and Order(ig \. 

j){Q) with m • OrA&x{iq i j){P). 

For the case of homologous population Q the situation is particularly simple: 
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Lemma 63. Suppose the population P is homologous. Suppose further, that neither one of 
the sets V(Pm, /ig+i) and V{Pra, hq \ is empty. Then Vm G N we have 



ifq = k-l 



Order(i,;»,+i)(P)+z,;s(P) ^ j, _ i 

1 

-jei<,^J'°'"<'"(«9^j)(^)+*?^s(P) 



v(p„,ft,\'i,+i)-^v(p„,/i,+i) 1^ 1 !fr,-k-^ 

E,^.,^^,Oi-der(»,;i)(P)+»,is(P) 
Ordel-(^,i^,+l)(P)+^,J,I:(P) 



Consequently, Vm G N 



, ifn<:h-^ 

T^q,m{V{Pm, kg+i)) ^ I e., j^p „d i Order(j,ii)(P)+»,4,s (P) '/ ^ 

7r,,™(V(P™,/l,\^.+l)) \ ^^.^,^^^Orderfa^j)(P)+.a.(P)-l '/'? = fc-l 

Proof. The first and the second conclusions follow from equations [33] and [35] combined 
with lemma [nU definition [50l and comment following equation [36] The last conclusion is 
an immediate application of equation [18] to the lumping quotient of the Markov chain J^q 
into the two states A = V{Pm, /iq+i) and B = V{Pm, hq \ /iq+i). □ 

All that remains to do now to establish theorem [40] in the special case of homologous 
population P is to show that whenever 1 < q < k — 1 and none of the "trivial ex- 
tremes" takes place (see the beginning of this subsection), the sets V{Pm, hq+i) and 
V{Pm, hq \ hq+i) are nonempty. This will be done later jointly with the corresponding 
fact needed for the general case. Meanwhile, we return to the estimation of the ratios of the 
form ^'''T^ui '^"l' ^ \ ^ in the general case. Suppose, for now, the following statement 
is true: 

Vg with 1 < q < k3 const{q) G (0, 1) such that V sufficiently large m 

we have /9™(V(P™, hq+i)) > const{q) and /9™(V(P„, hq \ > const{q) (37) 

In the general case of non-homologous population P the presence of immobile states sig- 
nificantly complicates the situation. This is where Markov inequality comes to the rescue 
telling us that as m increases the height of the first rollout (and hence the number of im- 
mobile states) being large becomes more and more rare event so that the bounds in the 
inequalities [33] and [34] as well as inequalities [35] and [36] get closer and closer together. We 
now proceed in detail. Recall the construction of the sets U^^ starting with equation [24] 
and ending with inequality [27] Let 6 > he given. According to inequality [27] 3 
large enough so that Vm > Mi we have Pm(C^m^°"''*^''^^'') < S ■ const{q + 1). where 
const{q + 1) is as in the assumption statement[37] We now have 

f^Un I. N ^ rr*'Const((j+l)\ ( jjS-const(q+l)\ 



{ViPm,hq + l)) - Trq,m{ViPm,hq+l)) 
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\V(.P,^:h,)\ ^ \U.m '\ ^ \[P,„]^\ 

\ViPrr.,h,^,)\ |(V(P™, h„+l)\ \ViPr..K+,)\ 



Analogously, 



(j-j-S-const{q-\-l)\ 
^ ^"^ )_ ^ S ■ constjq + 1) ^ ^ ^^g^ 

PmiyiPm, hq+i)) ~ const{q + l) 

T^q,rn (V(P„, hq \ hq+i) H 



7rg,m(V(Pm, hq\hq+l)) 



< 



J j-S-co7ist{q-\-l)\ f J j6-const{q'\-l) 



< ^ — = — < 5 (39) 

7rg,m(V(Pm, hq \ hg+i)) Pm{V{Pm, hq \ hq+i)) 

Now observe that as long as a population Q e V(Pm, /iq+i) \ (jjg hight of 

the first rollout Hi{Q) < {6 ■ const{q + 1)) ■ m < S ■ m (recall how the sets of the form 
are introduced froml26]l. Now, for q < k — 1 inequalities |33]|34] and lemma l60l tell us 
that for V m > Af 1 we have 



Order(^, | j, P)) + (P)) - S ■ 



m 

— < 



m 



TO • |Statesi(i5 | P)| 



m- |Statesi(i, 4P)1 
so that dividing the numerator and the denominator by to gives 



[Ej&MP)^j^^,+i Order(i, i j, P)j+iq is ~S 

|Statesi(i, i P)| ~ 

< PQ^V(P„, < |Statesi(z,iP)| ^^^^ 

Entirely analogous and, by now, well familiar to the reader reasoning with inequality [39] 
playing the role of inequality |38] shows that whenever to > AIi and a population Q G 

V(P™, /i, \ \ c/^-"^*(«+i) we have 

Orderfa P) - d Order{iq j ig+i, P) 

|Statesi(*,;^)l -PQ^v(P„,^+i) ^ |Statesi(z, ; P)| ^ ^ 

Now inequaUties [38l [39l l40l and |4T] allow us to apply lemma|54]with A = V(Pm, /iq+i), 
S = V(P™, /i, \ /iq+i) and U = [/^^''^-'C^+i) ^nd concluding that V m > Mi we have 

"> IStatesifajP)! ^ Trq,ra{VyPm, _^ 

^ 2:.e.,4^(P),,^.,+l0^der(»,iJ^P)+»,iI:(P) ^^ " TT,. „. (V(P„, , /i, \ hq+i)) " 

" ■ I |Statesi(»,iP)| ) + 
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|Statesi(i,^P) 



< 



(l-<5) 



|Statesi(i,4.P)| 



Multiplying the numerator and the denominator of the leftmost and the rightmost fractions 
by the constant |Statesi(ig 4- ^)| which does not depend on m we obtain 

(l-S)- (Order(zg ; P) - 5 ■ |Statesi(iq ; P)|) 



< 



(1 ~ ^) • (E,e.,i(P),,^.,^, Order(z, ; j, P) + i, is {P)) + 5 ■ |Statesi(*, i P) 

~ 7I'g,m(V(P„i, hq\hq+l)) ~ 

(1 - ^) • Order(z^ j ig+i, P) + 5 ■ |Statesi(z^ j P)\ 



(1 - 5) ( Eje^a(P),.¥»,+i ^ P) + *9 {P)^5- |Statesi(z, i P)\ 



(42) 



Now simply observe that the leftmost and the rightmost sides of the inequality |42] are both 
differentiable (and, hence, continuous) functions of 5 on the domain (—0.5, 0.5) (notice 
that the denominators do not vanish on this domain thanks to the assumption that neither 
of the trivial extremes takes place). It follows immediately then that both, the leftmost and 
the rightmost sides of the inequality |42] converge to the same value, namely to the desired 
ratio 



R 



Oxdex{iq i iq+i, P) 



EJe^,UP).J^H+l Oi"der(ig ; j, P) + ig is {P) 
as 6 ^ 0. From the definition of a limit of a real-valued function at a point, it fol- 
lows that given any e > we can choose small enough (5 > such that both, the left- 
most and the rightmost sides of the inequality |42] are within e error of R. We have now 
shown that depending on this 6 we can then choose sufficiently large M so that the ratio 
^i "^'-'^(P'^^'^i+^'>^ , being squeezed between the two quantities within the e error of R, 



^,,™(V(P„,/i,\/i,+i)) 

is itself within the error at most e of R. In summary, we have finally proved the following 

Lemma 64. Assume that the statement in |57| is true. Then whenever \ <q<k — \we 
have 

7r,;,m(V(fm, hq+i)) Order(z^ 4, ig+i, P) 

-^^^ 7r,,m(V(P™, hq \ hq+i)) T,jet,UP). 3^^,+l Order{iqij, P) + Is {P) ' 
An entirely analogous argument shows the following: 

Lemma 65. Assume that the statement in\37\is true. Then 

{V{P„„ hk)) 



lim 



{ViPm, hk-l\hk)) 



Eje,fc_u(P)^'''^^'"(*'=-i P)+ik-i is (-P) - 1' 
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According to lemmas [58] and |59l equations [30land|3Tl lemmas |64l [65] |63] and equations l22l 
and|23] all that remains to be proven to establish theorem|40]is the following: 

Suppose neither of the trivial extremes takes place. Then the statement in equation\37\ 
is true. Furthermore, in case of homologous recombination the statement is true for all m 
(not only for large enough m). 

Proof. We proceed by induction on the index q. First of all, recall from the beginning of 
the current subsection that we have already shown that V m G N we have 

, NN byiemmalU] , . Order(a I ii , P) 

Pm[V[Pm, hi)) = hm ^{hi, Prn, t) = > 

where the last inequality holds because none of the trivial extremes takes place so that 
Order(a J, ii, P) ^ (recall that p„i denotes the uniform probability distribution on [Pm] 



so that I 



p™(V(P,„, hi)) - "^^-f^^l ). Since V(P,„, /u) = V(P™, /i2)WV(P,„, h^Xh^) 

we also have p„(V(P™, /^2))+Pm(V(P„„ hi\h2)) = /9™(V(P™, hi)) = ' = 

consto where 1 > consto > and consto is independent of m. It follows then that at least 
one of the following is true: p,„(V(P,„, /is)) > or p„,(V(P,„, hi \ /^s)) > 

In the general case, choose Mi large enough so that \f m > Mi we have Pm{Um " ) < 
(recall the part of the proof starting with equation|24]and ending with inequalitvl27]l. 
It follows then that either 

la\ consto ( consto \ consto 



Pm V{Pn, h2)\Um' > Or Pm V(P„, hi\h2)\Um' > 



4 



An akeady familiar argument exploiting corollary |53] inequalities [33] |34] |35] [36] and 
lemma [60] shows that, thanks to the assumption that no trivial extremes take place, and 
observing that 1 — > i for all large enough m the ratios 

^1 

V(P„,/l2)\(7„ |^V(P,„,/ll\h2) 



p 



Pv(P,^,hi\h2)^V{P^,h2) 



and, likewise. 



( v(p,„./ii\/i2)\;7„7^ )^v(p„ 



^V(Pm, 'i2)->V(P„, /ll\/l2) 



> K2 



where both, ki and K2 > and independent of m. Now we apply lemma [55] to 
the sets B ~ V{Pm, h2) \ Um " and A ~ V(P,„, hi \ /12) in the case when 



Pm (^V(P„, /i2)\C/™ ^ j > ^i^ortothepairofsetsP = V(P™, hi\h2)\Um * 

and A = V(P„, /i2) in the case when p,„ (^V(P„, /ii \ /12) \ C/,!f^^ > ^^22^*0, tells 
us that if we let const{l) = minj^^S^, ^^SZHlo . . ^j^^j^ ^j^^ statement in 

[37] is true for (7 = 1. This establishes the base case of induction. Now observe that if the 
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Statement in |32] holds for some q then it is true, in particular, that 3 a constant const{q) 
independent of m such that for all large enough m we have V{P,n, hq) > const (q). Now 
the validity of the statement in [37] for q + 1 follows from an entirely analogous argument 
to the one in the base case of induction with const{q) playing the role of const{0) and the 
Markov chain A4q replacing the Markov chain A^i. In the case of homologous recombina- 
tion, an even simpler (since there is no need to worry about the height of the first rollout), 
analogous argument shows that the statement in[37]holds V m. □ 

6. A Further Strengthening of the General Finite Population Geiringer 
Theorem for Evolutionary Algorithms 

6.1. A Form of the Classical Contraction Mapping Principle for a Family of 
Maps having the same Fixed Point 

The material of this section requires familiarity with elementary point set topology or with 
basic theory of metric spaces (see, for instance, [21]). Throughout this section {X, d) de- 
notes a complete metric space. We recall the following from classical theory of metric 
spaces: 

Definition 67. We say that a map / : X — ^ X is a contraction on X if 3 A; < 1 such that 
\/ X, y £ X we have d{f{x), f{y)) < k ■ d{x, y). We also call k a contraction rate^^ We 
may then say that / is a contraction with contraction rate at most k. 

The classical result known as contraction mapping principle states the following: 

Theorem 68 (Contraction Mapping Principle) Suppose {X, d) is a complete metric 
space and f : X ^ X is a contraction on X in the sense of definition \67\ Then 31 z G X 
such that y y X we have lim„_>.oo /" (y) — z. 

Proof. The proof can be found in nearly every textbook on point set topology such as |(2TT|, 
for instance. □ 

In our application we will exploit the following natural extension of definition l67l 

Definition 69. Suppose {X, d) is a complete metric space. We say that a family of maps 
7^ C {/ I / : X ^ X} is an equi-contraction family if 3 k < 1 such that V/ G and 
Vx, y e X we have d{f{x), f{y)) < k ■ d{x, y). 

Evidently, if the family T of contractions is finite, one can take the maximum of a set 
K = {kf I Vx, y £ X we have d{f{x), f{y)) < fc/ • d{x^y)} so that we immediately 
deduce the following important (for our application) corollary: 

Corollary 70. If T is any finite family of contractions on the metric space X then T is an 
equi-contraction family. 

"Evidently contraction rate is not unique witli such a notion. Nonetheless, the minimal contraction rate does exist 
since it is the inf{A: | fe is a contraction rate}. 



October 24, 2011 2:34 Emerald/INSTRUCTION FILE InvitedSubmittedFirst- 
DraftForArchive 



44 Mitavskiy, Rowe, Cannings 

The classical contraction mapping principle says that every contraction map on a complete 
metric space has a unique fixed point. Here we need a slight extension of theorem|68] which 
probably appears as an exercise in some point set topology or real analysis textbook, but 
for the sake of completeness it is included in our paper 

Theorem 71. Suppose we are given an equi-contraction family T on the complete metric 
space (X, d). Suppose further that every / £ has the same unique fixed point z {in 
accordance with theorem \68\l . Consider any sequence of composed functions gi — fi, g2 — 
/2 o gi ■ ■ ■ ,gn = fn°gn-i where each fi £ T ( it is allowed for fi — fj when i ^ j). Then 
Vy G X limn^oogniy) — z exponentially fast for some constant k < 1. In particular, 
the convergence rate does not depend either on the sequence (as long as it is 

constructed in the manner described above). Moreover, in case d is a bounded metric (i.e. 
sup^ yex "^(^^i y) < 'Xi)> the convergence rate does not depend even on the choice of the 
initial point y G X. 

Proof. Since all the functions fi have the same fixed point z, it is clear by induction that 
Vn we have gn{z) = z. Since T is an equi-contraction family, in accordance with def- 
inition |69] 3 fc < 1 such that d{f{x), f{y)) < k ■ d{x,y). We now have d{gi{y), z) — 
d{fi{y), fi{z)) < k ■ d(y, z). If d(gm{y), z) < k"" ■ d{y, z), then d(g„,+i(y), z) = 
d(/™+i(,g™(2/)), f,n+i{z)) < k ■ d{g,n{y), z) < fc • (fc™ • d{y, z)) = fc™+i • d(y, z) so 
that by induction it follows that Vn £ N we have d{gn{y), z) < k"^ ■ d{y, z). But < 1 
so that d{gn{y), z) — ?> exponentially fast as n ^> oo which is another way of stating 
the first desired conclusion. If sup^, '^(^i 2/) < oo then d{gn{y), z) < k"^ ■ d{y, z) < 
k"" -snp^ y^xdix, y). □ 

6.2. What does Theorem^l\tell us about Markov Chains? 

Suppose is a Markov chain on a finite state space X with transition matrix P ~ 
{px~^y}x.yex- Clearly P extends to the linear map on the free vector space M'*^ spanned 
by the point mass probability distributions which form an orthonormal basis of this vector 
space (isomorphic to M.l'^l, of course) under the Li norm defined as the sum of the absolute 
values of the coordinates: || X^xeA' ''^^^lUi — ^xex l''^ I- hnear endomorphism P de- 
fined by the matrix {px^y}x,y<^x with respect to the basis X restricts to the probability 
simplex 



(which is closed and bounded in M and hence is compact which is way stronger than we 
need). The following well-known fact from basic Markov chain theory allows us to apply 
the tools from subsection l6.1l For the sake of completeness a proof is included. 

Theorem 72. Suppose A4 with notation as above is an irreducible Markov chain. ( mean- 
ing that\/x,y £ X we have p^^y > Oj. Then P = {px^y}x,yex ■ ^x ^ ^x (see 
equation \43\l is a contraction (see definition^?} on the complete and bounded probability 




(43) 
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simplex Ax with respect to the metric induced by the Li norm i.e. \\u\\l-^ = X^xsA^ l^^l 
where u — X^^jgA' u^^Moreover, the contraction rate (see definition \67i is at most l — \X\e 
where e > is any number smaller than mina;^ x Px^y 

Proof. First notice that given any Markov transition matrix R — {rx^y}x,y£X, and any 
two probability distributions tt and a E Ax, we have 



rx^yi7r{x) - cr(x)) 



xex 



<J2Y1 rx-^yki^) - ^{^)\ 
yex xex 



xexyex xex 
In summary, we have shown that 

V Markov transition matrix R — {rx^y}x. yex on the state space X and 
V probability distributions n, a E Ax we have 

||i?(7r - a)U, - ||i?(7r) - R{a)h, < |K - ^lU, (44) 

There is one more simple fact we observe: let J denote an X x X matrix with all entries 
equal to 1. Given any vector u = J^xex '^xX, we have J ■ u — v — 'Y^^ex "^^x where 
y y E X we have Vy = J2xex independently of y. It is clear then that the kernel of the 
matrix J, 

Ker{J) = {u\u = UxX and Ux — 0}. 

xex xex 

In particular, if tt and a are probability distributions on X, then the sums of coordinates 

^xex^'^i''-)) ~ ^xex('^(-'^)) = 1 so that the vector vr — cr £ Ker{J) i.e. J(7r — ct) = 0. 
In summary, we deduce the following: 

V probability distributions tt and cr G Ax we have J{tt — cr) = 0. (45) 

The assumption that Px^y > together with the assumption that ^ is a finite set imply 
that we can find a positive number e > such that < e < min{pj;^j^ \x, y E X}. Let 
N — \X\ denote the size of the state space X and notice that by the choice of e in the 
previous sentence, V x E X we have N ■ e < J2yex P^^v = 1 so that a = 1 — Ne > 0. 
We can now write 

P = {P - eJ) + eJ ^ a (^{P - ej) j + ej = aQ + ej (46) 



"Of course, the total variation norm, which is a constant scaling of the Li norm by a factor of ^, can be used in 
place of the Li norm alternatively. 



October 24, 2011 2:34 Emerald/INSTRUCTION FILE InvitedSubmittedFirst- 
DraftForArchive 



46 Mitavskiy, Rowe. Cannings 

where Q — ^(-P ^ ^J) = {Qx^y}x. yeA" is a stochastic matrix, i.e. \/ x ^ X the sum of the 
entries 



= 1. 



vex yGX 



so that Q is a matrix representing a Markov chain on the state space X. Now, given any two 
distributions tt and a G Ax, using the decomposition of the matrix P given in equation l46l 
together with the facts expressed in equation|45]we obtain 

P(7r — a) = {aQ + eJ)(7r — cr) = aQ{'K — cr) + eJ(7r — cr) = aQ['K — a) 

so that, since Q is a matrix which represents a Markov chain, the fact expressed in equa- 
tion|44]readily gives us the desired conclusion that 

||P(7r - (t)\\li = \\aQ{-K - (t)||li = a.\\Q{TT ~ cj)\\l^ < a\\n - a\\Li 

which shows that P is a contraction since we demonstrated before that < a < 1. □ 

In corollary |70] we saw that any finite family of contraction maps is an equi-contraction 
family. For Markov transition matrices (also called stochastic matrices in the literature) 
significantly more is true. The following notion is naturally motivated by definition |69l and 
theorem 1721 

Definition 73. Given a family of Markov transition matrices 

~ {{Pl^y}x,yex I « e I, TT G Ax and Vi G I and Vy G A" we have 

J2 Px^y'^x = TTj, and /3 = inf > 0} 

x&X 

indexed by some set X, sharing a common stationary distribution tt and such that the great- 
est lower bound of all the entries from all the matrices in F, let's call it /3, is strictly positive 
(or, equivalently, is not 0) we say that is a family of interchangeable Markov transition 
matrices with lower bound /3. 

Apparently, theoreml72]immediatelv implies the following 

Corollary 74. Every interchangeable family J- of Markov transition matrices with lower 
bound P is an equi-contraction family with a common contraction rate at most a = 1 — | A" | e 
for any e with < e < /3. 

Moreover, families of interchangeable Markov transition matrices can often be easily ex- 
pended as follows. 

Corollary 75. Suppose that a family J- of Markov transition matrices over the same state 
space X is interchangeable with lower bound p. Then so is the convex hull of the family T, 

k k 

A(T) = {r|T = ^tjM, where k e N andW < i < k we have <U <l^U = l}. 

i=l i=l 
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Proof. Given a matrix T — {tx^y}x.yex £ ^(•^)' we can write T — X)j=i ^j^^j ^ 
A(J') with Mj = {pi^y}x,yex eJ',0<tj<l and J^^^^ tj = 1. But then Vx, y e X 

we have t^^y — J2j=i ' Px^y ^ Sj=i tj - b — bso that the desired conclusion follows 
at once. □ 

Combining theorem|22] corollary |70] and corollary |75]readily gives the following 

Corollary 76. Suppose we are given a finite family J- of Markov transition matrices such 
that all the entries of each matrix M G T are strictly positive. Then A{J-) is an equi- 
contraction family. 

Corollarvl76]extends the applicability of the finite population Geiringer theorem appearing 
in IIT3I and in |12l (and, possibly some other homogenous-time Markov chain construc- 
tions) to non-homogenous time Markov chains generated by arbitrary stochastic processes 
in the sense below. 

Theorem 77. Consider any finite set X. Let T denote a finite family of Markov transition 
matrices on X such that all the entries of each matrix M G J- are strictly positive and all 
the matrices in T have a common stationary distribution tt. Now consider any stochastic 
process {^nj^^i with each Z„ — (F„, X„) on J- y, X having the following properties: 

Fq and Xq are independent random variables. (47) 
For n > I Fn does not depend on Xn, Xn+i, . . . ,( however, it may depend on 

Xo, Xi, . . . , Xn-i as well as tetany other implicit parameters). (48) 

The stochastic process Xn is a non-homogenous time Markov chain on X with transition 
matrices Fn{w). More explicitly 

IfFkiuj) = {Px-^y}x.yex thenMy e X we have 

PiXn = y) = E ^(^"-1 = ^)^'"4- (49) 

xex 

Then the non-homogenous time Markov chain converges to the unique stationary distri- 
bution TT exponentially fast regardless of the initial distribution of Xq. More precisely, 
3 a G (0, 1) such thatyt G N we have 

\\P{Xt = ■) - ttU, < a' 

where P{Xt — ■) denotes the probability distribution of the random variable Xf. 

Proof. Observe that if we want to compute the distribution of Xi given the distribution of 
Xq, we need to select a Markov transition matrix M = {mx^y}x, yex G ^ with respect 
to the probability distribution of i^o which is independent of Xq. The value of Xi is then 
obtained by selecting a value x of with respect to the initial distribution P(Xo = •) 
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and then obtaining the next state Xi = y with probability P{Xi ^ y) — nix-^y Thereby 
y y E X we may write 

P(Xi =y)^J2Il P(P^ = M and Xo = x)m.^y 'y^'^^"'-^ 



= J2 PiXo = x)J2 ^(^" = M)m.,^y. (50) 
xex MeT 

Since is a finite set, VAf £ J'wehaveF(Fo = M) e [0, 1] and X;:reA' -^(-'^o = x) = 1, 
we deduce that the matrix Tq ~ J^Mej^ P{Fo — M) ■ M G A(7^) is a Markov transition 
matrix and equation|50]can be alternatively written in the vector form as 

P{Xi = -) = To-P{Xo = -). (51) 

Continuing inductively, if we assume 

PiXk - •) = Tk-i o . . . o Ti o To • P(Xo = •) (52) 

for fc > 1 where the Markov transition matrices Ti £ A (7^), then it follows analogously to 
the above reasoning that 

Tj/v \ TJ/IP HT ^ V \ by independenceo/_Ffc and Xfe 

P{Xk+i = y) = 2^ 2^ P{Fk = M and Xk = x)m^^y = 

M£j^x£X 

= P{Xk ^x)Y, PiPk - M)m,^y 
xGX Mer 

so that for the same reasons as before we may conclude that 

P{Xk+i ^■)^Tk- P{Xk ^■)=Tk- {Tk-i o . . . o Ti o To • P{Xq = •)) - 

= TfcoTfc_io...oTioTo-P(Xo = -)- (53) 

where Tk e A(J") = J^MeJ^ ^i^k = M) ■ M e A{T) for the same reason as To G J". 
We now conclude by induction that V t G N we have 

PiXt = •) = o . . . o Ti o To • P{Xo = ■) (54) 

where Vi G N U {0} we have Ti G A(J^). According to corollary |76]the family of Markov 
transition matrices A(J^) is an equi-contraction family with the same common stationary 
distribution tt and now the desired conclusion follows immediately from theoreml?!] □ 

Remark 78. It is interesting to notice that the non-homogenous time Markov process Xn 
in theorem [TT] may be generated by non-Markovian processes P„ where the Markov tran- 
sition matrices P„ depend not only on the past history Poj ^^i; • • ■ , but also on the 
history of the stochastic process X„ itself. This property is interesting not only from the 
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mathematical point of view but also in regard to the main subject of the current paper: the 
application to the Monte Carlo Tree search method. Due to the past history in a certain 
game as well as other possibly hidden circumstances (such as human mood, psychological 
state etc.), a player may suspect the states being interchangeable to bigger or smaller de- 
gree. Theorems like [TT] demonstrate that in most cases this will not matter in the limiting 
case which strengthens the theoretical foundation in support of the main ideas presented in 
this work. 

One can extend theorem|22]further to be applicable to a wider class of families of Markov 
transition matrices having a common stationary distribution than just these having all pos- 
itive entries. 

Definition 79. We say that a family T of Markov transition matrices is irreducible and 
aperiodic with a common stationary distribution tt if tt is a stationary distribution of every 
matrix in T and 3 fc £ N such that V sequence of transformations {M^j^^^ with Mi £ J- 
the composed Markov transition matrix T = Mio M2 o. . .oMk has strictly positive entries 
and TT is a stationary distribution of every Markov transition matrix M G J^. We also say 
that fc is the common reachable index. 

If we were to start with a finite irreducible and aperiodic family of Markov transition matri- 
ces !F with a common reachable index fc in the sense of definition|29]then the corresponding 
family 

F^{L\L^MioM20 ...o Mk with M, e J"} (55) 

has the size = |J^|'^ < 00 and every matrix in the family has strictly positive entries. 
It follows immediately from corollary |76] that A (7^) is an equi-contraction family. Now 
suppose that we are dealing with the same stochastic process as described in the statement 
of theorem|77]with the only exception that the family is a finite irreducible and aperiodic 
family with a common reachable index fc rather than "a finite family of Markov transition 
matrices on X such that all the entries of each matrix AI G J' are strictly positive". No- 
tice that the proof of theorem [77] does not use the assumption that the Markov transition 
matrix entries are strictly positive up to the last step following equation |54] Therefore, it 
follows that the same equation holds for a finite irreducible and aperiodic family of Markov 
transition matrices, i.e. 

Vi e N we have P{Xt = ■) ^ Tt-i o . . . o Ti o Tq • P(Xo = •) (56) 

where Vi G N we have Ti G A(J^). We now observe the following simple fact. 

Lemma 80. The family of linear transformations (and Markov transition matrices in par- 
ticular) 

where 



A( J") = {T I T = Ti o Ta o . . . o Tfe with T, G A(J')} 



(57) 
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and the family A (7^) is the convex hull of the family T introduced in equation\55\in the 
sense of the defining equation in corollarv \75\ 

Proof. Given a transformation 

T = Ti o Ta o . . . o Tfc e A{f), (58) 
since each Ti G A(J^), we have 

/(i) /(i) 
V i with 1 < i < A: we have = ^ *j A^j(0 with < t) < 1 and = 1. (59) 

Plugging equation |59] into equation |58] and using the linearity of r,s we obtain 

i(l) i(2) / \ 



i(l) = li(2) = l i(fc) = l V=l 

i 



since < OLi ^ 1 ^"^^ 



E E ■•• E n^;.) HE^l E^n 

,(l) = lj(2) = l j(k) = l / = l / = l / 



E^n-1 



j(l) = lj(2) = l = l / V = l / V = l / V = l 

from equation |59] so that the desired conclusion follows at once. □ 

Now continue with equation|56]so that we can write 

V< e N we have P{Xt = ■) = Tt-i o . . . o Ti o Tq ■ P{Xq = ■) = 

= Tt-l O . . . O Tm-k+l O 7m. fe O Tjn-k~l O ■ ■ ■ O T^m-l)-k+l ° T(^m-l)-k ° ■ ■ ■ 



r-fold composition fc-fold composition 

■ • • o 72^-1 o . . . o Tk+1 o Tfc o Tk-1 . . . o Ti o To • P{Xo = •) = 

A:— fold composition A;— fold composition 

= Tt_i o . . . o Tjn-k+i ° T„i.k o Tlji-i o F,n^2 o . . . o Fi o Fo • P(Xo = •) (60) 

where m [|:J and r < fc is the remainder after dividing thy k and each Fi e A(J^) C 
A(J^) thanks to lemma [80l Since A(F) is an equi-contraction family (see equation |55] 
and the discussion which follows this equation), it follows immediately that we can find a 
constant a G [0, 1) such that 

\\F,n-l O Fra-2 O . . . O Fi O Fo • F(Xo = OIUi < ' \\P{Xo = OIUl- 
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Furthermore, according to equation l44l which concludes the first part of the proof of theo- 
rem|72l we also have 

\\Tt-l O . . . O Tjn-k+l O Tm-k O Frn-1 O Frn-2 O . . . O Fi O Fq ■ P{Xq = = 
= ||(Tt_i O . . . O T„r.k+1 O Tr,^.k) O [Fyn-l ° F,n-2 O . . . O Fi O Fq • P(Xo = •))IUl < 

< ||F„,_i o F„,_2 o . . . o Fi o Fo • P{Xo = < a" • \\P{Xo = OIUi- 

The observations above lead to the following extension of theorem 177] 

Theorem 81. Consider any finite set X. Suppose T a is a finite irreducible and aperi- 
odic family with a common reachable index k and all the matrices in J- have a com- 
mon stationary distribution tt. Now consider any stochastic process {^n}^i with each 
Zn = [Fn, Xn) on F X X having the following properties: 

Fq and Xq are independent random variables. (61) 
For n > 1 Fn does not depend on Xn, Xn+i, • . • , (however, it may depend on 

Xq, Xi, . . . , Xn~i as well as many other implicit parameters). (62) 

The stochastic process Xn is a non-homogenous time Markov chain on X with transition 
matrices Fn (w). More explicitly 

IfFk{uj) = {p'^^y}x,y<EX then\fy e X we have 

P{Xn =y)=Y. P(Xn-l = X)p^l,l. (63) 

Then the non-homogenous time Markov chain converges to the unique stationary distri- 
bution TT exponentially fast regardless of the initial distribution of Xq. More precisely, 
3 a G (0, 1) such thatyt G N we have 

where P{Xt — ■) denotes the probability distribution of the random variable Xt and 
m{t) = LiJ- 

7. Conclusions and Upcoming Work 

This is the first in a sequel of papers leading to the development and applications of very 
promising and novel Monte Carlo sampling techniques for reinforcement learning in the 
setting of POMDPs (partially observable Markov decision processes). In this work we have 
established a version of Geiringer-like theorem with non-homologous recombination well- 
suitable for the development of dynamic programming Monte Carlo search algorithms to 
cope with randomness and incomplete information. More explicitly, the theorem provides 
an insight into how one may take full advantage of a sample of seemingly independent 
rollouts by exploiting symmetries within the space of observations as well as additional 
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similarities that may be provided as expert knowledge. Due to space limitations the actual 
algorithms will appear in the upcoming works. Additionally, the general finite-population 
Geiringer theorem appearing in the PhD thesis of the first author as well as in fOl and llT2l 
has been further strengthened with the aim of amplifying the reasons why the above ideas 
are highly promising in applications, not mentioning the mathematical importance. 
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